“We are a small population of people because of genocide. No other reason. If you eliminate us in the data, we don’t exist. We don’t exist for the allocation of resources”.

– Abigail Echo-Hawk, Pawnee, Director of the Urban Indian Health Institute and Chief Research Officer of the Seattle Indian Health Board

There is a world of difference between saying “we’re less certain” and “they don’t count.” If you have a small sample size for a certain group of people, what you can say about them might be less “certain” or “reliable” than you want, but that doesn’t mean we should discount their data with terms like “not statistically significant.” It’s not mathematically correct, and it’s dehumanizing and harmful. 

 

Everyone wants data results with clear meanings that can be relied upon. How much weight we can give to any answer does depend on sample size, but it also depends on sample quality, sample similarity, and what we’re measuring. Sample size isn’t everything when it comes to “certainty.” 

 

A long and rigid tradition in statistical science has been to suppress the results from small sample sizes or neutralize them by calling them “not statistically significant.” The fear of having our work discounted because of high levels of uncertainty often leads us to transfer that burden onto the minority groups in our samples—better that they be discounted than our work. 

 

What we should do instead is get more comfortable talking about levels of reliability. The result from a small sample size is still going to have a confidence interval and a standard deviation. If you’re just encountering these ideas, here are the literal CliffsNotes to get you started. At We All Count, we’ve switched to adding brief, plain-language explanations of these concepts when releasing data, and it’s been very successful with our audiences across all levels of data literacy. 

 

Can you feel the difference between these two ways of reporting the same data?:

Option #1:
SubgroupAvg. Response:
“How much do you approve of the new law (1-10)?”
White8
LatinX7.2
Indigenous*

*Indigenous respondents sample size too small, not statistically significant.

Option #2:
Population SubgroupAvg. Response :“How much do you approve of the new law (1-10)?”Standard Deviation*Confidence Interval**
White82.2 +/- 0.3
LatinX7.24+/- 1.2
Indigenous5.81.2+/- 3.2

* The standard deviation for each result reflects how similar all of the responses in this category were. Higher numbers indicate more variation between respondents. A lot of very similar responses increases the likelihood that our results are a good reflection of that group as a whole. In this chart, the LatinX respondents had a comparatively high standard deviation.

** The confidence interval for each result is the likely range in which the average accurately reflects that subgroup. It’s based on a variety of factors including sample size and standard deviation. A small confidence interval means there’s very little chance that the estimate is way off. We would like to note that we sampled x respondents in the indigenous community, which is why we have a relatively large confidence interval in that group.

Note this phrase in our explanation of the confidence interval: “we sampled x respondents in the indigenous community.” This language places the onus of meaning and reliability on us, the data project producers. The asterisk no longer attacks the indigenous community for being small and our transparency about the data demonstrates our trust in our audience and their ability to decide for themselves how much weight to give our numbers. 

 

Of course, this doesn’t just apply to small samples within racialized or ethic groups. We can use this technique to talk about uncertainty and reliability in any kind of social construct data—race, sexual orientation, gender, nationality, etc.—and with previously discounted populations in any kind of demographic categories like income, language, age, etc. 

 

There’s no mathematical reason to discount a smaller sample, especially when we are willing to step up and discuss uncertainty in our numbers. Publishing information on uncertainty can increase everyone’s data literacy, empower audiences to make their own decisions, and make sure we all count, no matter who we are.