If we’re going to use P values as our measure of what’s a meaningful difference versus a difference due to random chance, we need to recognize that P values are highly dependent on sample size.
This dependency leads to a situation where we call a problem experienced by a large group “real” while we dismiss the very same problem experienced by smaller groups as “chance”.
Let’s say that our problem is systematic underpayment of women employees compared to men.
Is the company systematically paying women less than their male counterparts, or did it just happen to come out like that in the swirling semi-random soup of the human resources process (incoming salaries, education, experience, hiring practices, interviews, raise requests, job performance, etc.)?
In this jurisdiction, not only is gender a protected class for pay equity, race is too, so when we run the numbers, let’s break it out by race as well:
The difference in pay between employees who are men and employees who are white women is $900 per year with a p-value of .02. That’s the kind of p-value that your average stats professor salivates over. Oh boy, that’s not random chance at all! This is strong evidence of a systematic underpayment beyond what is likely to exist by chance. #thestruggleisreal
The difference in pay between men and Black women is $1,200. Uh oh, it looks even worse… but wait… this result has a p-value of 0.19! By the (mostly made-up and almost impossible to properly explain) laws of the almighty p-value, this ought to be covered with the bright red ink of the “NOT STATISTICALLY SIGNIFICANT” rubber stamp those professors keep in their desk drawer. Sorry Black women employees, but it looks like your pay disparity (though measurable) is just the way the cookie crumbles; with a p-value like this, it might just be random chance that you are being paid less than your colleagues.
This is like the statistical equivalent of gaslighting. The reason the Black women had a higher p-value (even though the absolute pay difference was even larger than for white women) isn’t because they aren’t experiencing systematic pay discrimination. It’s because they’re facing that and a company that doesn’t hire many of them. Black women make up only 7% of this company, leading to a small sample size when the pay equity analysis is broken out by race. These women are facing an even more severe problem that likely requires an even more significant payout, but the method we’ve chosen to examine this hides that fact.
The first key takeaway is that a high p-value doesn’t mean that a difference isn’t “real”; it simply means we don’t know. While a low p-value is sometimes a good indicator of a meaningful, non-chance-based difference, a high one usually just indicates insufficient evidence. Claiming that these women’s pay inequality isn’t real due to a high p-value is like if a detective working to solve a murder didn’t have enough evidence of how it occurred and therefore declared that the murder didn’t happen. Ridiculous.
The second thing to understand is that when we rely exclusively on p-values as the measure of “realness”, we privilege large groups while disadvantaging smaller ones. We are being taught to use a method that can deny the experience of those in smaller groups, without being taught that second part at all.
Note: Just to be clear, this is true of any p-values measuring any kind of difference between groups, not just in a pay equity context.
I wish that p-values were the magical thing many people think they are; that they could be the be-all-end-all way to objectively cut through the random chaos of life (or in this case the random chaos of corporate pay structures) and tell us equally well for any size of group what’s meaningful and what isn’t. But they can’t. We need to use p-values within the limits of what they can and can’t do and own those limitations. We also need to seek out other solutions when the inequity of applying p-values across differences for differently sized groups doesn’t meet our equity goals and standards.