Author’s Note: This is going to be a long piece, but if we can get this concept down we’ll learn a way to embed our equity priorities deep, deep into the mathematical heart of our data work. Let’s go.
The Model: A reflection of the world as the modeller understands it.
“All models are approximations. Essentially, all models are wrong, but some are useful. However, the approximate nature of the model must always be borne in mind.”
-George E. P. Box
When we make a statistical model to explore a causal question, we’re often trying to measure the effect that one thing (variable) has on another (variable). Let’s say we want to know if participation in a math club is improving our students’ likelihood to graduate within our school board.
We’re looking to measure a causal relationship (as opposed to predictive or descriptive). Causal is the hardest and most sought-after kind of analysis (though maybe not as financially lucrative as predictive *ahem Amazon and Netflix ahem*) because it aims to describe the effect of the math club on the likelihood of graduation. If we’re a school board trustee, it can help us decide whether to fund more math clubs. If we’re a student or perhaps a parent of that student we can use this information to decide whether or not joining a math club is worthwhile.
So, we’ve got our first two variables. (For those of you who are way past these basics, hang in there because it gets interesting and there are equity ramifications that they don’t teach you about in school. Yet!)
In most every model, we need to isolate/account for/control for the effect of other variables that could be affecting our dependent variable. Of course, which variables are even part of your model is a massive equity issue, but variable selection is a subject for another article. Today we’re focusing on what to do with these other variables once we’ve included them.
Let’s say that we want to account for “Time Spent Studying”. A pretty reasonable thing to think might affect our main variables.
We want to control for “Time Spent Studying” but guess what? There’s more than one way to do that. Each way reflects a certain worldview or expected relationship between these three variables. And each way will produce a different result.
We could build our model under the assumption that “Time Spent Studying” affects both of the other variables. Maybe we think that how much time a student spends studying affects how likely they are to be in a math club. It could be that we think it makes them more likely because they are huge nerds for math, or it could be that we think students who already study a lot independently won’t want or need a math club.
Regardless, we think that it is affecting math club involvement. We also think that how much time a student spends studying affects their likelihood of graduating independently from whether or not they are in a math club, certainly a view held by millions of nagging parents around the world, myself included.
Modelling this way, we’re treating the variable of “Time Spent Studying” as a CONFOUNDER. (Ok, welcome to the somewhat unhelpful and very specific stats terminology that we think it’s worth learning even though we don’t love the terms and their confusing implications…) CONFOUNDERS are variables modelled in a way that can affect both the independent and dependent variables.
Note an important idea: a variable can be modelled as a CONFOUNDER, that doesn’t mean it is a CONFOUNDER. We’re about to see that almost any variable can be considered in more than one way.
What if we think the relationship works more like this:
In this model, the independent variable (Math Club) is affecting “Time Spent Studying”. Maybe, we think that being in a math club inspired additional studying, or reduces the amount someone studies, and then “Time Spent Studying” is in turn affecting the likelihood of graduation, in addition to whatever effect Math Club is directly having on the likelihood of graduation (the bottom arrow).
This setup would treat Time Spent Studying as a MEDIATOR:
What if we think that the variable “Time Spent Studying” has its greatest impact not on a variable, but on the relationship or effect itself? Then we’re talking about putting it in the model as a MODERATOR like this:
What worldview would lead us to model like this? In this model, we’re not expecting math club to change the time students spend studying, rather we’re looking at the effect of how much the students already are studying on the relationship between our key variables. Maybe we think that students who don’t study very much will benefit greatly from a math club, while those who already spend a lot of time studying won’t see much impact on their likelihood to graduate. Maybe we think the inverse of that; that students with a lot of time spent studying will experience a greater level of effect from math club involvement.
At this point, this all feels like a trick question. You just want to know which one is correct. Is “Time Spent Studying” a mediator, a moderator, or a confounder? Just tell us the answer, please! Well, there is no “right” answer. All of these models can be technically correct. They just model a different perspective or assumption about how the variables relate to each other.
To throw you another curveball: these relationships and effects aren’t mutually exclusive in the complexity of the real world, but they are pretty much mutually exclusive in the model. You can’t effectively control for a variable as a moderator, a mediator, and a confounder all at once. I don’t want to totally alienate people with the depth of this article and that’s a bit of an oversimplification, but if you are actually building models I highly encourage you to look into this issue.
Let’s recap: in causal models, there are multiple ways to account for variables and each of these ways reflects a different kind of relationship. Our design is based on assumptions, other people’s work, or the general worldview that we’re basing our model on. That’s true of all science. Every experiment requires a set of assumptions and parameters that you aren’t going to test as part of this particular experiment. If you are in high school science, you usually have to write these assumptions out in a short paragraph at the top of your report. I wish we didn’t abandon that requirement so quickly in a few sectors… anyway, I’m getting off-topic…
Why does this matter for equity? (And how the heck do I choose how to model any given variable?)
It’s not a problem that worldviews are unavoidably part of modelling. It is a problem when that part of the process goes unexamined. Analysis is the dark heart of data science where a small group or often a single individual makes enormously complicated decisions in how to model all the variables. Decisions that other stakeholders in the process don’t even know about.
This conversation isn’t nuanced enough:
Program Director: “Hey, we really care about how much study time is related to this math club program we’re considering funding…”
Analyst: “Oh don’t worry I controlled for it in the model!”
Program Director: “Oh ok, great.”
How did the analyst control for it? We need to know whose perspective and worldview was used to model the variable relationships. If it goes unexamined, then it’s usually the analyst’s. If we’ve got a Motivation Statement, we might already know whose perspective we want to prioritize and that can be applied right down to the model.
Let’s say that we’re trying to center the perspective of students experiencing poverty. We want to know if our Math Club program works, and we’ve set a benchmark for at least a 10% increase in the likelihood to graduate as “worthwhile”. Which way should we model “Time Spent Studying”? Mediator, Moderator, or Confounder?
Experts on our advisory panel, which includes students in poverty, have reminded us that many students in our school board have after-school jobs that are essential sources of income for their households, and they have generally less time to study than their more well-off peers.
If we control for Time Spent as a CONFOUNDER we will neutralize the importance of how much time they have to study from our model, and therefore largely remove any measure of the difference it might cause. This means we don’t care that there are disparities in the privilege of time to spend studying.
If we model it as a MEDIATOR, we will reflect the idea that math club’s effect on graduation likelihood is greatly dependent on whether or not it changes how much time that the students study, showing the importance of being able to take advantage of the extra time needed to participate fully in the program and reap the benefits. If increasing study time is the mechanism through which Math Club is effective, then different students will benefit differently.
If we model it as a MODERATOR, we will account for study time by identifying how effective the math club participation is depending on how much a student already studies. This reflects a view that only certain students with certain amounts of study time will benefit from math club. It could be that those with the least time to study (students in poverty are among that group) may benefit the most from the support and intensive instruction of a math club. It could be that only students with ample study time will be the only ones who benefit.
These models are going to have different outcomes and some may generate results that pass our “at least 10% increase” benchmark for continuing the program and some may not. By knowing whose perspective we’re trying to best represent in the model we improve the answer to the question “is math club working?” by changing it to “is math club working, and for who?”.
This isn’t P-Hacking (which our creative director really likes to pronounce “facking” because he has the comedic taste of an eight-year-old). We’re not suggesting running model after model to get the number you want. In fact, it is ideal to decide whose experience to model on before running the models.
It’s also important to note that we are not saying that causal modelling which uses moderators is always the most equitable form of modelling. In some contexts, one of the other models in this example would have been just as good at centering the values and equity priorities of your project.
It also shouldn’t shake your faith in all data science that we can create equally valid models that have different outcomes. The models may be equally “correct” but some are better than others at representing the experience you care about and are better at answering the question that you are actually asking. Keep pushing to get your equity priorities right into the heart of your data process and you’ll get better equity and better science at the same time!