I was once working with a dataset that had a variable called “Group Gender Composition.”  The options were “Men”, “Women”, or “Mixed.” I was doing an analysis of group process and productivity broken out by group gender composition. I also had the dataset that included the list of members in each group.  When I crossed referenced them, I found that some of the groups marked “Women” in the Group Gender Composition included men on the roster – and vice versa. When I spoke with the people who collected the data they let me know that the variable called “Group Gender Composition” reflected the opinion of the data collector about whether the men or the women were contributing the most work to the group. This was one of my earliest wake-up calls that it’s not ethical to use a dataset without spending time getting a very good understanding of what the data means.

After spending several years developing tools for digging into the background of datasets it also became clear to me that this was not just an ethics issue but also an equity issue.  Once I spent time understanding what was contained in the data, where it came from, how it was measured, who collected it, and much more, I realized that many of the characterizations and results based on datasets were wildly inequitable. For example, datasets that were being used to calculate population-level rates did not include large portions of that population. And the people that were most often excluded were the most vulnerable or marginalized. Or data on attitudes were collected by people who were so much higher in status than the respondents that the answers were clearly skewed towards the dominant view.

Building a data biography, a comprehensive background of the conception, birth and life of any dataset is an essential step along the path to equity in data science. This includes all data: data that you collected, data that comes from large trusted sources, data that comes from open data libraries, data that comes from peer-reviewed research. All data.

We have developed two versions of the data biography: a short version and a comprehensive version.

The short version of the data biography is the basics. It consists of four core questions:

Who:

Who collected the data?

Who owns the data?

How:

The methods behind the data collection design and process?

Where:

In what locations was the data collected?

Where is the data stored?

Why:

For what purpose was the data collected?

When:

When was the data collected?

You can download a copy of a free template to get you started here.

We are also developing a much more detailed data biography tool. An interactive online version is currently in beta and you can test it here. If you’d prefer to download a static version you can find that here.