It is tempting, when pursuing a project that requires heavy use of data, to focus just on that particular project. Collect or locate the data needed to answer the question at hand, work out the conclusions based on that data, act on the conclusions, and discard the data. But there has been a growing awareness, both in scientific investigation and in business, that this way of thinking about data is obsolete; there is value to data beyond its initial use, and locating and cataloging data is as important as collecting it in the first place. In 48 BC, the famous Library of Alexandria burnt to the ground, destroying for all time priceless knowledge about the ancient world. Today, we are living through our own burning of Alexandria. Each day, countless data sets are lost as we fail to retain them for future use. And just as was the case for the Library of Alexandria, we also don’t even have a clear idea of what we have lost.
While scientific funding agents and data officers in enterprises have been aware of this issue for a while, the importance of retaining data only became clear to the general public at the beginning of the COVID-19 pandemic. At that time, there was considerable confusion about transmission of the disease, symptoms, long-term effects, and dangers to particular sectors of the public. Data about these things was collected by agencies and researchers around the world, and was used to guide public policy about masks, venue closures, treatment management, and more. Public sentiment quickly went from indifference (“Why should I care about reusing data?”) to indignation (“Why can’t we get the data we need now?”). The COVID-19 pandemic is not the only major societal issue that can benefit from usable and reusable data; food scarcity, climate change, and advancement of underdeveloped countries are issues that require widespread and systematic reuse of data.
With important issues like this at stake, what can we do about how data is managed, on a global scale? Governments and scientific funding agencies already mandate in many circumstances that data be made available on a web page somewhere, and many such repositories exist today. But in too many situations, when a project finishes or funding moves on, these websites are shut down, and the data goes the way of the Alexandria knowledge. The datasets that are published often lack metadata that allows them to be found easily with the search engines that we use on a daily basis. Is there anything that we can do to preserve data so that our future selves – not to mention future generations – will be able to use it?
FAIR Data
In 2016, the concept of FAIR data was introduced in the journal Scientific Data as a response to the “urgent need to improve the infrastructure supporting the reuse of scholarly data.” But the applicability of FAIR data principles goes well beyond scholarly data; these principles provide guidelines for how we can make all data more durable.
The “F” in FAIR stands for “findable.” We are all accustomed to finding web pages using search engines, but many data sets are distributed as spreadsheets, which typically don’t include the sorts of descriptions that search engines can index. A simple action to make data more findable is to include a short description of the spreadsheet, for instance, “Vaccine distributed and administered counts as reported to CDC by US jurisdictions.”
The “A” in FAIR stands for “accessible.” This is often the most difficult to manage long-term; you can put data onto a web server, but if you have to pay for that server from project funds, it will go away when the project ends. Fortunately, it is easy to host data sources of considerable size for free on my own company’s server; a Google account is all that’s needed to create an account where you can host original content, or echo content from another source. There’s really no excuse for data to be lost when funding for a project ends.
The “I” in FAIR stands for “interoperable.” The simplest and most common issue with interoperability is figuring out what columns in a spreadsheet mean. Even something as simple as adding a notation to a column like “two-letter state abbreviation” enhances the interoperability of datasets. If you go further, and add source information for the column, e.g., “two-letter state abbreviation according to ISO 3166-2,” then the data becomes even more interoperable.
Finally, the “R” in FAIR stands for “reusable.” In the FAIR context, this refers to permission to use the data; publish along with the data guidelines (usually in the form of licenses) for how it may be used. An upside of publishing data with a very permissive license is that other community members might host copies of the data, improving its accessibility long-term.
Making data FAIR isn’t an easy task; it takes some commitment from someone who is willing to annotate data, host it, and think about policies for how it is to be reused. But as costly as it may be to make data FAIR, it is even costlier not to make it FAIR. Recreating data that is no longer available costs a lot more than making it FAIR in the first place – and FAIR data keeps paying benefits over and over again.
What Can I Do?
A journey of a thousand miles begins with a single step. Making and maintaining FAIR data might seem like a daunting task, but getting started is easy. You can create a dataset and copy or create some data (don’t forget to indicate any relevant licensing information as well!). Our data community currently contains over half a million data sets – any of which can interoperate with any other. When you add your data, you have taken your first step toward making the world’s data more FAIR.
In this short piece, I’ve barely scratched the surface of how one can make data FAIR; in fact, there is even a bit of a cottage industry of companies who will help an organization to go well beyond what we have outlined here. But one of the upsides of the FAIR data principles is that they don’t just apply to the data originators and managers. FAIR data practices can also be followed by data users, and even bystanders. As long as the originator of the data provides sufficient permission (the “R” in FAIR), anyone on the web can supply the other three. You can host a copy of the data, you can describe the data, you can annotate the columns and values. Even these small steps will contribute to the durability of data on the web.
Why would you want to do this? Well, why does anyone make contributions to the community they live in? In general life, we call someone who gives back to their society a good citizen; the FAIR practices provide an outline of how to be a good data citizen. A good citizen doesn’t make contributions exclusively for their own benefit; they make them so that society will be a better place for everyone.
Dean Allemang is the principal solutions architect at data.world.