Facebook Pixel

BACKGROUND: We All Count asked you, our project members, for your experiences – both positive and negative – with situations where you can’t pick the data that you’re asked to work with. We heard about difficulties with poorly designed collection questions; non-representative sample populations; vague, missing and useless metadata; and much more, but the most common problem we’re hearing about is inadequate social construct information or categories. How can we embed equity between groups and individuals within our population when we don’t even have inclusive and accurate information about them?

“I’m wondering if you have strategies for embedding equity in the “data collection” category for projects that get data second-hand. Over 90% of our projects use data that has already been collected (usually government or non-profit) and historically we have had no input on the construction of categories.”


Here’s a situation that happens all the time: A researcher is handed a pile of data that someone else collected and is asked to answer important questions about the people in that data. The researcher looks over the dataset and heaves a heavy sigh; the social construct categories (demographics) are woefully inadequate to extract the kind of equitable answers they need.


What can they do? Should they refuse to use the data based on their commitment to equity? Should they change or abandon their original questions? Can they do anything to improve the data they have, short of going out and collecting it again? Can they report on this data in a way that acknowledges its shortcomings and doesn’t lead to anyone feeling left out or forgotten? The answer is yes. Yes to all of these answers, depending on your situation.


Karina, a health equity professional from New England, shared her experiences trying to create maps of infant and mother health equity by using the publicly available US Medicare data. The data does include a marker for race and ethnicity and she initially followed other researchers and used this to understand health disparities. However, after doing some work with the Data Equity Framework, she started to build a data biography for the US Medicare data. She realized that the race and ethnicity marker data had some serious issues. It was much more accurate at correctly identifying white people than anyone else.


When she did some cross-checking by calling respondents and asking them how they racially identified, the rate of the Medicare data matching the first-hand responses was high for ‘White’ and ‘Black’ (the rate was 97 and 96 percent, respectively). Only 52 percent of Asian, 33 percent of Hispanic or Latino, and 33 percent of American Indian or Alaska Native beneficiaries were correctly identified. So – can she even use this data to explore health equity when the data is already so embedded with equity problems?


Option 1: Don’t use the data.


Ok, here’s the most extreme option and the one we’re least likely to recommend. Sometimes, your raw data is so inequitable that it shouldn’t be used.

Once we were given data from a city-wide phone poll about employment standards violations. The people who collected this data wanted to get answers about differences between men and women answering the calls, but when we dug into the metadata (the information about the data), we found out that the gender of the respondent was assigned by the collector on the other end of the phone, based on how they “sounded.” The problematic assumptions, stereotyping, and lack of self-identification made it impossible to equitably answer their questions, and we told them so. Disappointing, but necessary.


When you don’t use the data, you’re losing whatever useful potential was there, which is unfortunate and often costly. You’re also likely to cause immense workplace tension right up to situations where people lose their jobs. Unless it’s absolutely, unsaveably bad, we recommend against this. On the flip side, if someone is pressuring you to use data that’s absolutely, unsaveably bad, taking a principled stand against its use is probably the most effective step that we can all take to stop reinforcing oppressive practices and crappy science.


Ok, now for the more common, less scary options we have.


Option 2: Augment the data.


If you’re missing specific demographic categories, you may be able to source additional data directly from the respondents you need, or from data sources that are set up to specifically address these gaps – for example, these exceptional indigenous data resources in Canada or Australia or the USA.


For Karina in our example above, this is the option we recommended. She can supplement the poor race categories in her Medicaid data by matching it with US census data that (while still flawed) has at least three more nuanced and more recently updated questions about racial identity, allowing her to get to the heart of that area of equity in her project.


Option 3: Improve the data.


There are two great ways to do this, depending on the scope of your resources and project type: recollection or statistical blending.


Ok, you got us, recollection of data gets pretty close to being able to pick the data – the opposite of the problem we’re addressing. It’s worth talking about, though, depending on what type of data you’re working with. You don’t have to start the collection again from scratch just because you need to go back and get a little more information in an area that’s missing from someone else’s dataset.


We were working with a large company on their pay equity using HR data. When we got into the dataset, we found that the gender of employees was self-assigned by a hiring intake form, but the form only had two options – male or female – to choose from. This limited, binary gender option didn’t match this company’s equity standards and they agreed that it was outdated and didn’t give their employees the choices they wanted around their gender identities. Rather than scrapping all of their decades of HR data, we were able to simply send around a one-question company-wide survey with more inclusive gender options, making the larger dataset much more useful in answering their gender equity questions.


Secondly, you can use methods like this to blend your data statistically with other data beyond simply matching with an additional dataset the way we talked about in option 2.


Option 4: Be transparent about the data equity issues.


The last option we’ve seen people use successfully is one of those things we’d call “way better than nothing.” If you feel that it’s worth analyzing and reporting on the data as is, it’s important to be truly transparent about who got missed and what equity questions you wanted to answer but couldn’t. Talking about the equity shortcomings of your data in the report or final data product turns angry people who feel overlooked into supportive allies who feel heard and want to support your intentions to do better next time.


We’ve found that, as data audiences and project participants, we can understand project limitations and forgive shortcomings if the data producers talk about them upfront and outline what they’re doing so they can do better next time. There is no “perfect” dataset without equity issues – only people who are working to do better and people who aren’t.