In my early years of analyzing data, I worked on a project that was designed to improve women’s self-empowerment. We needed to develop some data that would help us understand the potential impact of the project. Measuring self-empowerment is not a straightforward thing to do; it’s not like I can use a thermometer to measure self-empowerment the same way we can to measure a basal temperature.

When you need data on something that you don’t have a direct measurement of, we turn to something called proxies. Proxy variables are data that we hope captures the quantitative essence of a thing without actually measuring a thing. They are used a lot in data projects, mostly because we often don’t have the data we actually want or need.

In my early experience with the women’s self-empowerment project we gathered data on things like a woman’s ability to make spending decisions with money, her ability to travel freely, and her ability to own things she wanted. A key piece of data we collected was whether or not a woman could make domestic financial decisions on her own. It wasn’t until we started having conversations with the women in our study that we realized this was a bad proxy for the level of these women’s self-empowerment. Many of the women told us that they did not want to make these decisions on their own, but that they wanted to make them in partnership with their spouses. Rather than being told to ‘handle’ the day to day finances, they saw it as a better situation to be part of an equal team. Our definition of empowerment and the creation of a proxy variable was not a culturally appropriate choice.

In many ways, proxy variables can be thought of as synonyms; data capturing information that is very similar but not quite the same as the main, original term.

Because proxies are built on assumptions or subjective calculations about what variable might represent another, they are key sticking points for equity issues. Going through the project with a fine tooth comb to identify and verify all the proxies being used is an easy and effective way to make the worldviews embedded in your work transparent and helps you make any necessary changes.

Proxies are everywhere in data projects. Researchers often want to understand how personal economics and wealth affect decisions and outcomes, yet this is almost never data that we directly have. A country’s standard of living at a broad level is often proxied with Gross Domestic Product (GDP). Which is hugely problematic for a long list of reasons like income inequality, the translation of financial wealth into improving lives, and cultural differences in measures of ‘standard of living’. 

The Demographic Health Surveys database is filled with high quality survey data from several dozen countries. The health data in this database is great, however it contains no direct data about income or wealth. And most health equity research needs to include some measure of this. As a proxy, most data projects instead choose to use some of the information about the assets that are owned by the household or the state of the house itself.

The asset variables in the DHS include things like the number of rooms in the house, and indicators for whether the household has a refrigerator, clock or watch, sewing machine, VCR, radio, television, fan, bicycle, car, motorcycle, electric lighting, flush toilet or latrine, and livestock; whether the kitchen is in a separate room in the house; whether the primary, cooking fuel is wood, cow dung, or coal; and whether the drinking and non-drinking water comes from a pump or an open source (as opposed to being piped into the home).

These variables do reflect to some extent the level of wealth in a household, but they also reflect the values and preferences of the people in the house. I might be surprised to find a sewing machine in a millionaire’s mansion, but that doesn’t mean they can’t afford one. Using assets as a proxy for wealth requires very nuanced understanding of the interplay between culture, personality, and assets. 

Proxy measures can be powerful tools for data projects who do not have the exact data they want, but know the outcome they are trying to achieve. Infant mortality rates, for example, are a direct measure of healthcare quality but are also a proxy for the economic and social welfare of a community. The unemployment rate is a direct measure of unemployment, but is also a proxy for the overall state of our economy.

Economists and other social scientists rate proxy variables based on their correlation with the thing being studied. This is methodologically dubious in some cases, as the very reason for using a proxy variable is that the actual value of the thing being studied is not known. A common example is happiness. While there are various indices of happiness, they are all an assembled cocktail of proxy variables. It isn’t that living in a place with low infant mortality causes euphoria, it’s that this is a factor that is generally included in happiness index calculations (partly because it in itself is a proxy for other things, like the availability of good hospital care).

Any time that a proxy variable is being used, a specific world view or set of values is being embedded in the data, the analysis, and the outcomes. There are basic assumptions about how well the proxy variable represents the information you’re looking for, even when mathematically tied to correlations. It’s also impossible to know if there might be a better proxy variable that you couldn’t think of from your perspective as a researcher. 

Whether GDP is a good proxy for standard of living really depends on your place in the current world economic hierarchy.

Whether home value is a good proxy for socioeconomic status really depends on whether you believe in a certain, very specific idea about a ‘correct’ life trajectory. I know many high income long-term renters. 

Whether making decisions independently or jointly is a better proxy for empowerment really depends on your attitudes towards power and relationships.

Whether the number of violent crimes reported to the police or the number of crimes that are self-reported on a survey is a better proxy for community violent crime rate depends a lot on your views of crime and law enforcement.

All of these proxies are currently being used in data projects today.

So, what to do with proxies? You’re still going to need to use them. The way to infuse an equity lens into the process is being very transparent about the fact that you’re using them and explaining your reasoning. You also need to spend time actively deciding whether the proxies you’re using are actually measuring what you need to measure – and from whose point of view.