It’s not ethical to rank countries without giving us details on the data. It’s not equitable to rank countries on data from wildly different years. Where is the data biography for the new OECD Social Institutions and Gender Index?I do a lot of work in the area of global domestic violence. I’m always excited to get my hands on new data in order to understand emerging trends and see how society is progressing. The OECD recently released an entirely updated database that includes domestic violence data – complete with rankings and ratings by country. I was so excited I even attended the live release of the data. However, after getting my hands on the data I’ve realized that it’s basically unusable. And that the way the OECD is using it inequitable and unethical. Because they don’t have a complete and accessible data biography. It is the responsibility of every data producer to include this information and the responsibility of everyone who uses the data to check for it.
Don’t Blindly Accept InformationThis is a great example of how NO SOURCE of data should be accepted without important details like the metadata (data biography) and methodology. No matter how large or trustworthy the source appears. This prestigious international institution is using data to assign countries ratings and rankings:
#1: Without giving us any real transparency about what data they’re using. (For example, the links to the information on the who, what, when of the data are largely missing or lead to blank pages.)And
#2: Using data that appears to be incredibly old for some countries and just not addressing the age of the data for the rest of the countries. (For example, ranking countries by using 2006 data from some countries, 2010 data for others, and so on, while calling it the 2019 ranking.)Not okay for a tool that claims that it “provides a strong evidence base to effectively address the discriminatory social institutions that hold back progress on gender equality and women’s empowerment and allows policy makers to scope out reform options and assess their likely effects on gender equality in social institutions.” It’s unethical to present a simplified rank or summary rating for countries that is explicitly designed to be used by policy makers without providing transparency about what data you’re using, when it was collected, who it was collected from. Data details like this can be found in any respectable dataset’s data biography. It’s also unethical and inequitable to be building a cross-country comparison based on such a large range of years without explicitly stating it. National social attitudes can change a great deal in a decade. Additionally, we know that the time and money spent collecting this data isn’t equitable between countries.
The SIGI homepage.
The Issue of TransparencyThe Social Institutions & Gender Index (SIGI), published by the OECD Development Centre includes four dimensions: discrimination in the family, restricted physical integrity, restricted access to productive and financial resources and restricted civil liberties. One domain is “Restricted Physical Integrity” – you can get the detailed definition here. That domain includes, among other factors, the prevalence of violence against women (VAW), attitudes towards VAW, and laws about VAW. When you seek to understand how these factors are being measured, this is where a data biography would be crucial. It’s not just academic. The lack of detail about the data has real-world consequences. Policy is created using this information and it’s not as simple as it is presented. For example, I’m pretty sure (I have to say pretty sure because there’s no actual links to the data being used so I had to sleuth around) that the data on attitudes towards domestic violence in Nicaragua is from 2006. And I really want to know when the Burkina Faso data is from because there seems to be a really steep change and then the data stops. How are you justifying giving Burkina Faso a ranking on this data? How can we equitably evaluate a country’s current situation when clearly there is a strong positive trend, with declining rates of acceptance of violence which then abruptly stops – probably because the data collection stopped – not because the trend stopped. The data they’re using is a combination of data from many different sources and years. When I try to find out where the data they’re using for each country is coming from this is what I find:
That’s very broad – a list of four sources across twelve years. How can I tell what data they’re using for which specific country? The source listing for the ‘prevalence of VAW’ data is even worse. There are no sources listed at all:
Here’s the methodology section of the SIGI documents. It’s helpful about how they’re using the data to calculate the rankings and the index. But says nothing about what data they’re using. After following many different links each leading me to various data sections, methodology sections, and even the OECD data website, I click on the ‘get the data link’ and it takes us to the OECD databank here. I am told that if I really need the details of where the data from each country came from, I need to go to the page for each individual country included in the Index. I go to the Canada page and below is what I get. The various factors are listed but everything else is blank. There are no figures, no sense of what the margin of error might be, and no hint as to what the sources of the data are.
Empty metadata categories. *crying face*
The Way ForwardSo, how to proceed? There are two parts to this story. What to do if you really want to use this data and what to do if you are producing this data (or data like this). If you really want to use this data, you’re going to have to hunt down the sources yourself. Without knowing, at minimum, the year the data was collected and who the data was collected from, you simply can’t use it. I spent some time googling terms like “Canada violence against women data” and “Canada interpersonal violence prevalence data” and “Canada attitudes towards violence against women data” and was able to find some trails I could follow. I also find a lot of discussion about the difference in rates in Canada depending on what type of source you’re using. Further emphasizing the point that in order to use this SIGI Index I really need to know what data they’re using.
A great example of a Data Biography from the OECD.
I want to emphasize that I’m not just bashing the OECD. Anyone who is collecting and processing data for social good is on the same team as me! The OECD itself provides some much better examples of how to do this. In addition to the SIGI index the OECD also produces something they call the Better Life Index. This one also aggregates lots of data from many sources and ranks countries. And the data biography provided for the background data in this index is much better. For example, for the factor “Housing: homes without acceptable facilities” they provide a detailed description of the data sources, what is being measured, and what year is being used for each country. Looking this up allows us to see that they’re ranking Canada based on 1997 data – which you may question – but at least you have the facts needed to decide for yourself. Mixed source data isn’t inherently useless, but it needs to be used with extreme transparency. When you have the opportunity to publish a dataset make sure that the data biography is complete and available. On the flip side, never, ever, ever use, interact, or subconsciously absorb data that you know nothing about like it’s fact. Find out if the apples are apples and the oranges are oranges.