Reusing data is a fundamental part of artificial intelligence and machine learning. Yet, when we collect data for one purpose, and use it for other purposes, we could be crossing both legal and ethical boundaries.
How can we address the ethics of reusing data?
Understand Your Data
Before we address the issue of reuse, we should first seek to understand the original context of our dataset. Why do we have this data? Where did it come from? Who collected it? For what purpose? What else do we know about it?
One tool that can help data scientists with these questions is Datasheets for Datasets. Datasheets for Datasets was first proposed by a group of researchers led by Dr. Timnit Gebru. The datasheet contains vital information about the data that addresses the kinds of social questions we’ve posed as well as other technical details about the data. You can create your own datasheet or you can find an example of one in Ethically Aligned AI’s Ethics Toolkit.
Your organization might also have data governance software tools that can provide some or perhaps all of these details. Knowing your data lineage, data quality, and other pertinent details can help you to make better decisions about the fitness of the data as you assess its usefulness in the context of your new use case.
Don’t Abuse Consent
Many of the legal questions around reuse of data center on whether or not the data contains personally identifiable information (PII). You’ll need to understand and abide by privacy obligations per your local jurisdiction as well as any other relevant regulations for your use case.
An ethically controversial situation involves the blanket consent clause. This is an overly broad clause that essentially gives an organization the ability to use data in a wide variety of contexts that remain unspecified or vaguely specified at the point where consent is obtained. These clauses are often buried in lengthy and difficult-to-understand privacy policies or terms and conditions agreements. It’s typically couched in language about “using data to improve all current and future products and services” or “sharing data with affiliates and partners.” Most policies include that kind of language because it serves the interests of the company by providing broad legal cover. However, we need to consider if that really constitutes meaningful, informed consent.
Respect Purpose Use Limitations
One important principle in data ethics is purpose use limitations. That is the idea that we should only use data for the purposes for which we are gathering it. If we consider that someone has agreed to provide data to our organization to access a specific benefit or allow us to provide a good or service, it’s not reasonable to think that we have carte blanche to use their data however else we see fit. In assessing reuse, we might ask:
- Why do we believe this data is fit for our new purpose?
- Could the purpose for which we intend to reuse this data cause harm to the data subject or to other people?
- How does our new purpose impact or change the accuracy, integrity, or veracity of the data?
- What data do we intend to combine with this data in reusing it? What ethical issues might be raised in joining this data?
- How old is this data? Is it past its shelf-life?
- Will the reuse of this data be objectionable based on some moral principles that might be held by the data subject?
These are just some examples of ethical questions we might ask as we assess the reuse of the data.
Repurposing Data to Train AI
One new use case that is being driven by generative AI is the reuse of customer data to train AI models. Some organizations have been quietly changing their policies to disclose this new use case. However, the Federal Trade Commission (FTC) has said this could be constituted as an unfair or deceptive practice:
“It may be unfair or deceptive for a company to adopt more permissive data practices—for example, to start sharing consumers’ data with third parties or using that data for AI training—and to only inform consumers of this change through a surreptitious, retroactive amendment to its terms of service or privacy policy.” (FTC)
There are also many ethical questions raised around the licensing of communally created data being repurposed as training data. Reddit has signed a $60 million/year deal with Google to sell its users data to train AI models, while Automattic, owners of Tumblr and WordPress.com, are moving down this pathway with OpenAI. The onus appears to be on individual users to “opt out” if they don’t want to take part, but is that practical or reasonable to expect people to do? This is a use case that would not have been clear to anyone who launched a Tumblr blog in 2007!
Ethics of the Data Broker Economy
Some ethicists have taken a very strong stance against the data broker economy itself, saying it should be abolished. Spanish philosopher Carissa Véliz covers this in her book “Privacy is Power”:
“The logic of the data economy is a perverse one: mine as much personal data from people as possible, at any cost. And we are paying too steep a price for digital tech. We are no longer treated as equals; we are each treated according to our data. We don’t see the same content, we don’t pay the same price for the same product, and we are not offered the same opportunities. The data economy is undermining equality.” (Véliz, 2020)
She isn’t alone. Other academic researchers, such as Nick Couldry, Wendy H. Wong, and Elizabeth Reneris, have raised similar types of concerns.
What About Your Organization?
Companies sitting on vast amounts of customer and employee data might also be thinking about using it for a range of other purposes, including to train AI models. However, before doing that, it’s a good idea to seek legal advice as to whether or not that is possible given your prior privacy policies. Repercussions for being offside could include class action lawsuits and penalties from regulators up to and including algorithmic disgorgement, the enforced destruction of your model.
Ethically, consider consulting directly with your stakeholders about your intentions to use their data for this new purpose (or any other new purpose). Requiring an “opt in” for a new use provides more agency to your stakeholders, but at minimum having an “opt out” allows them to exercise some level of control. Providing meaningful mechanisms to exert agency and seek redress is a core part of good data ethics practices and responsible AI.
It’s also worth noting that going the ethical route will likely also address your legal obligations because you’d be obtaining new express consent to use the data. If you don’t get stakeholder buy-in, that is also good information. It’s an opportunity to reflect on and consider whether or not your proposed use is actually beneficial for your stakeholders and community.
In thinking about if you should repurpose data and how to do it ethically, there is no one-size-fits-all answer. This is why having an AI and/or data ethics committee in place to help guide your ethical deliberations is a strongly recommended best practice.
Send Me Your Questions!
I would love to hear about your data dilemmas or AI ethics questions and quandaries. You can send me a note at hello@ethicallyalignedai.com or connect with me on LinkedIn. I will keep all inquiries confidential and remove any potentially sensitive information – so please feel free to keep things high level and anonymous as well.
This column is not legal advice. The information provided is strictly for educational purposes. AI and data regulation is an evolving area and anyone with specific questions should seek advice from a legal professional.