Facebook Pixel

The following is an excerpt from Heather Krause’s 2022 Keynote: “Who is the Data Privileged Person?”.

So, a statistical model in a data project is simply all of the relationships between variables setup in the way that is the modeller’s best possible description of the world around them. Compared to the infinite variables and relationships in reality, all models are limited, flawed, subjective shadows overlaid on to the real world. As George Box put it: “All models are wrong, but some are useful.” If we want to talk about equity we need to add the line: “useful to whom?”.

Let me give you an example of three equally valid models of the same data that privilege completely different people’s perspectives.

You are a data analyst who has been hired by a tech startup. They make a budgeting app called something silly like… Spnd without an E… Spnd costs $175 dollars for a yearly subscription, and the company wants empirical proof that by using the budget app users save even more than the cost of the subscription. They want to say that it quote: “pays for itself”.

Ok, you say, I need to run a causal analysis to see how app usage (our independent variable) affects our dependent variable: money saved. In order to see the effect of the app apart from other things that might affect how much money we manage to save, I’m going to account for them (sometimes called control for them).

One of the things I think to control for in my model is the user’s disposable income proportion – what proportion of their income is left over after paying for living essentials like food, shelter, heat, electricity, etc.

I’m going to control for it by holding it steady across all users – by locking it in place – and therefore isolate the effect that the budgeting app is having regardless of what portion of their income is disposable. I run this model and it turns out that their app isn’t having much of an effect at all on whether or not people saved at least $175 extra dollars. Sorry guys, I say in the meeting.

Hold on, say the app developers! Your model doesn’t make sense! You’ve controlled for disposable income as a confounder. The whole point of the budget app is that it affects your proportion of disposable income which in turn affects your savings. People who use our budgeting app realize that they are paying too much for things like food or rent and need to cut back, thereby gaining more money that they can save! You know, like, “I used to eat out for half my meals until I started using the app, now I pay attention to my spending and my food costs are way down.” That’s the whole point of budgeting!”

Fair enough. I run the same data through a model where disposable income is a mediator and the results this time suggest that the app works great!

Now, it’s not important to know what a confounder or a moderator or a mediator is in a model. These are just ways to describe the different types of ways to control for a variable in the model, you do not need to remember them. The key thing to know is: there’s more than one way to control for something in a model.

So back with the tech bros, we’re all about to celebrate the app when a woman from customer service raises her hand and says, “that’s all great, but if the app is so effective, why do we have a whole segment of users who don’t renew their subscriptions? She reads us a customer review: “Your app was easy to use and worked well, but I can’t justify the cost. I spent a lot of time writing out all of our bills and rent and grocery expenses but I’m already saving as much as I possibly can with no money left over. I’d rather save the subscription fee.””

You think about this and it makes a lot of sense. You draw a model that looks more like this: what if your proportion of disposable income actually affects the effect of the budgeting app? In data modelling, we call that a moderator. If you have a bunch of expenses or overspending that you can cut down on, or if you’ve got all kinds of bills and subscriptions due at different times you can’t keep track of, this app might be great for your savings.

But if you already live tight; if you know exactly when your landlord is going to knock on your door for the rent which is already as low as you can find it, or if you know which grocery store you need to go to to save a dollar on milk this week, this budgeting app is going to take up your time, cost you 175 bucks and maybe not help much at all. You run this model and the results show that the app work on average only ok, but if you break it out into groups you see why the average is middling: the app works great for users with highly flexible proportions of disposable income and has little to no positive effect for users who don’t.

Now, here’s the important part: none of these models are wrong. They just privilege a different perspective.

This model shows how the app performs assuming you are in a position to adjust expenses, that is the expected mechanism of change. If you say “the app works” based on this model you are privileging those in a position to benefit from how it works. Which in this case is people who have the financial leeway to do so.

If you use this model, you are assuming that the success of the app depends entirely on whether or not you can adjust your disposable income, giving voice to people across the spectrum of that experience, both positive and negative. That’s really important if you care about helping people already budgeted to within an inch of their life.

Sure your app might be a prettier, faster way to do that, but it might not be worth the 175 dollars to someone who is counting all of theirs.

So we have three different models using the same data, providing 3 different answers to the question: “does it work?”. When we build a model to see if something works we have to choose whose experience of the world that model reflects. Will it privilege the user who really needs to ditch two of his 8 TV streaming subscriptions? Or will we build a model that reflects the relationship between variables experienced by someone without any wiggle room at all in their monthly expenses?

The data privileged person has their experience reflected in the model. Every time we build a model, we privilege those whose perspective we consciously or unconsciously adopt when we do the math.