Facebook Pixel




This is part 2 of our examination of Proxy Variables, take a look at our introduction to the subject here.

So, is race a proxy variable? Should we use race as a proxy variable? In depends. Sometimes.


 In working we equity in quantitative data, we work with a lot of data on race. Data on race is extremely important to include in data projects with an equity lens. However, including it in certain ways can mask rather than reveal equity issues. 


 Let’s start by looking at the four most common reasons that a race variable is included in a data project: 


  1. It’s the only data you have
  2. You’re trying to understand the effect race is having an a trend or experience
  3. You’re trying to talk about racism
  4. You’re trying to show that race itself actually causes something


Almost all of the time, in cases 1, 2 and 3 you’re using race as a proxy. In many times case 4 is the only time you’re not using race as a proxy. When you choose to include race as a proxy, it’s super important to be clear in your project why and how you’re using this information. Let’s break it down a bit more. 


It’s the only data you have…

 It’s not uncommon* for open data or administrative data to collect race and gender and that’s pretty much it, for demographics. For example, if you’re trying to understand who is using your community recreation programs the most you might only have basic demographic data on race. It is possible to use this data to get a sense of who is using your programs, however, the potential for incorrect results and poor decisions around this are multitude. Check out our video on Simpson’s Paradox. You’re really telling your data what to say rather than exploring what it actually says in a case like this. If we only have data on race, we tend to see things like ‘it’s a race thing!’. 


When something like an attendance rate differs by race, race is usually representing other factors, we call this a proxy variable. Because race can reflect so many things (economic, historical, social, political, cultural, colonial, educational, geographical factors, etc.) and because it can seem relatively easy data to collect it gets used all the time. The first problem is that by using race as a proxy variable, you can’t say which factor or which complex network of factors race is standing in for. Secondly, gathering race data is fraught with issues: race is a nebulous and subjective description compared to, say, height, it’s easy to collect only when people are racialized and trained to know their race, and race is sometimes assigned by someone other than the respondent themselves and when the respondent does get to choose they are often faced with a poorly crafted list of options.


If you only have the top-level ‘race’ data and you want to highlight differences seen in race then be transparent that you know race is standing in for other factors, you just don’t have the data to know what they are. 


*In the US. It is much less common in Europe, Asia, and Africa where it’s more common to collect data on language or country of birth.


You’re trying to understand the effect race is having an a trend or experience…


When you have data that shows that different races are experiencing something differently, it seems logical to say so.  For example, school district data that shows that children of color are, on average, getting lower test scores than white children. It is not inherently incorrect to say this. What is important is how you say it. If we have data that shows that children of color are getting lower test scores, race here is a proxy. There is nothing about race that is directly causing the lower test scores. It’s important to make this clear and dig deeper into discovering exactly WHAT race is a proxy for in the data.  


In this scenario unlike the previous, you might know or be able to determine what you are really getting at when you say ‘children of color’. So this part is tricky. We don’t want to avoid using race as a proxy here because we don’t want to ignore the fact that systems are encouraging different outcomes. At the same time, we don’t want to encourage a false belief that race itself is causing these differences. In this example, research exists that shows that in children’s test scores race is often acting as a proxy for wealth, access to healthcare, and school climate. Saying it exactly like that puts the focus on the real sources of impact and away from the umbrella term of ‘race’. 


You’re trying to talk about racism… 


Lots of times we want to include a race variable in our data project because we want to highlight the effects and impacts of racism. This is another case where things are tricky. By using race as a proxy for racism we can accidentally place the locus of power in the wrong place and encourage more racism. Also by using race as a proxy we mathematically homogenize the experience of an entire group of diverse people, which, of course, is part of racism.


One possibility is instead of calling the variable “race” try using exactly what you’re actually measuring (i.e. the reason you think race is relevant) like ‘racialized students’ or ‘colonial trauma survivors’ or ‘members of an economically disenfranchised community’. Another possibility is instead of using a person’s race as a proxy for racism, there are ways to directly measure experiences of oppression and discrimination. None of these are perfect. Being transparent about your choices is really the only way. 


You’re trying to show that race itself actually causes something… 


When you’re doing causal data work and you would like to show that race itself is actually causing something, you might be no longer using race as a proxy. In certain medical studies, using race as a direct variable to test how a medication is being metabolized or how to treat certain diseases most successfully, you may want to use a direct variable of race. However, even then it’s good to check whether you could use a more directly measurable piece of data rather than the social construct of race.



When you want to talk about race itself as a factor, it’s very rare and usually medical. Even if it’s medical don’t assume race isn’t a proxy.


When you want to talk about racism, oppression or other factors linked to race, say exactly what you mean and don’t use race as a proxy. It helps to put the onus in the right place.

When you want to talk about the differences in experience between races, look for better variables that ‘race’ might be hinting towards because race is almost never the actual cause of the difference. 


When you only have ‘race’ and can’t dig into what it might be a proxy for, don’t use it or be transparent that race is probably standing in for something else but you don’t have the data to find out what.