We now have unparalleled access to enormous amounts of data, automatically generated and gathered, which represent sample sizes that are nearly impossible to replicate with traditional survey methods. People excitedly tell us that many major equity issues, particularly representation issues, will be a thing of the past now that we can leverage these massive data sets.
At We All Count, we agree that Big Data is a valuable resource but we think there are some very important concerns that Big Data alone won’t fix. We think that what’s really exciting about Big Data is the ability to combine the efficiency and power of large datasets with the intentionality of small, curated data samples.
What is Big Data?
The term ‘Big Data’ gets thrown around a lot with varying definitions. Is the U.S. Census big data? It is a large and comprehensive data set. Are large, international data sets amalgamated from a variety of sources, like U.N. or World Bank datasets, Big Data? Is live data from a mid-sized phone app Big Data, because it has a lot of data points?
For our purposes, we’re going to define Big Data as: datasets that are really huge – measured in terabytes not gigabytes – containing automatically generated data points like online behaviour, purchases, live locations, ‘likes’, searches, etc. Think Google, Facebook, Amazon. Think Chase Bank, Mastercard, Walmart. Think Uber, AT&T, Netflix.
These are the kind of datasets that have massive disruptive implications for our world. They are a shift in what’s possible on the scale of the Industrial Revolution. They also have some major equity issues.
Big Data Power
In data science, the statistical strength of a given analysis is often limited or supported by the sample size. If you want to find answers to a question about an entire population, you need to get data about a statistically relevant percentage of that population. Large samples are inherently expensive, making answers about huge groups of people very difficult to achieve. Much of the focus of modern statistics has been discovering and refining statistical methods to achieve high statistical reliability. This research has also confirmed that the quality of your sample matters equally to its quantity.
Imagine that you want to find out if people in your town prefer shopping at Walmart or online. For the same cost, you can either survey a well randomized sample of 100 people across your town, or you can camp out in the Walmart parking lot and ask 1000 respondents. You can clearly see that simply increasing sample size without any regard to equitable representation would dramatically skew your results.
Along comes Big Data, and we suddenly have enormous sample sizes available to us. Instead of using a tiny fraction of the entire population, Big Data offers us huge slices of our population to use as samples. For example a national Gallup poll about a presidential election might have 1500 respondents, (keep in mind that these respondents are very carefully selected and the statistical methodology used to interpret those results is very robust) while Facebook has live data available on around 244 million Americans. That’s a sample over 100,000 times larger.
The statistical strength that you can achieve with such an enormous sample size, paired with the up-to-the-minute nature of a lot of this data can make it feel like we can answer statistical questions with an almost prophetic certainty. Smaller companies, local governments and NGOs are incredibly eager to harness the power of Big Data, and rightly so as it can offer incredible insight into policy decisions, impact studies and effectiveness. The tricky part is that Big Data is always collected with a specific intention and by a collector with a specific mandate. Nine times out of ten that mandate is to make money.
Big Data fans see it as a silver bullet to equity issues due to its scale and its inhuman indifference to most equity issues. Amazon’s data collection algorithms are being adjusted to maximize profits, not to maximize sales to a certain race or gender. If Walmart discovered it’s data collection process was ignoring all potential female customers, it would be changed immediately. Also with a data set that might say include 30% of all U.S. citizens, it’s easy to feel like the sample size is so large that it must include at least some representation of all types of people in the population.
Two Issues with Big Data and Equity
Amazon generates data about Amazon customers. Phone apps generate data about people who own smartphones. Uber has data about people who ride in Ubers. Big data has an inherent representation problem when compared to a well crafted traditional sample: it automatically doesn’t include the people it isn’t concerned with.
Additionally, because the sample sizes are so large, Big Data concentrates the impact of the most prolific data providers. That means if you shop a lot on Amazon and take a lot of Uber rides, your data is being counted way more than someone who only has the resources to do those things occasionally, or never.
So representation and weight are two challenges to overcome with Big Data. Businesses have a mandate to make money, so that’s alright for them but how can someone who also cares about finding equitable solutions use this data?
The Best of Both Worlds
Let’s pretend you are the local government of the city of Toronto. You want to know where to expand your subway system. What new location makes the most sense for the greatest number of Torontonians? You have a very large dataset from your ‘swipe card’ entry system, so you can see all kinds of data from people on the subway, street cars and busses. You’d like to also supplement your information with Uber’s massive dataset: this will allow you to see a huge sample of citizens and where they are using a form of transportation other than transit.
You know that both of your Big Datasets don’t represent everyone, you don’t have info on people who take neither transit or ubers, like drivers, pedestrians or people who can’t afford either option. You also know that these data sets concentrate the impact of the most frequent users and will have to make sure you account for that statistically. You have a limited budget and you could spend it on an expensive but rigorous survey that gets good representation but a smaller sample size, or you could use the money to access Uber’s massive and statistically significant dataset, even if there are some equity issues, you might get a more firm answer to your question.
Or you can harness the power of both. What Big Data offers is amazing efficiency. Yes it is expensive to operate the massive systems that collect, store and analyse such a huge volume of data but the cost per data point is many orders of magnitude cheaper than traditional survey methods. These savings can be used to supplement Big Datasets and fill in representational gaps using additional statistical methods. We can use the money saved to conduct a smaller, more targeted survey that can get answers specifically from people who aren’t represented by our Big Data.
We can use the Transit Data, the Uber Data and conduct our own research in a more targeted way to make sure we’re making the fairest decision for all stakeholders, without ignoring the predictive or authoritative power of the large datasets. Ignoring Big Data today is like ignoring steam engines in favor of a horse and cart, the difference in power and efficiency is inarguable. On the other hand assuming that Big Data will automatically solve equity issues when it wasn’t designed to do so is wishful thinking. By focusing on equity and using the power of rigorous statistical methodology to flesh out and re-weight Big Data, we can have the best of both worlds.