Click to learn more about author Brian Platz.
From hackable medical devices to combating fake news, data provenance is growing in importance. In addition to enabling trust and security, data provenance creates efficiencies for data scientists and opens up new lines of business. In a business environment where online trust is low, regulations such as GDPR are forcing compliance, and AI is more important than ever, provenance plays a foundational role.
What Is Data Provenance?
Provenance is defined by W3C as “information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability, or trustworthiness.” By knowing the history of data, in other words, you know whether to trust it. Provenance comes in the form of metadata that details a data packet’s lineage: its origin and changes made to it, with timestamps. Users can more easily track down errors, inaccuracies, flaws, and fraud, leading to better data-analytics outcomes.
In our world flooded with data, provenance generates widespread benefits. Here are five ways that data provenance can transform your business:
1. Maximize metadata
Quality metadata is essential for reusing and repurposing data sets, yet its full value is rarely realized. Data teams spend large amounts of time cleaning and organizing data. Once used, that same data is often stuffed and forgotten in online storage.
Data provenance, by tracking the history of every piece of data, essentially automates part of the metadata creation process, cutting time spent cleaning and organizing data. Lineages also make it easier to access metadata, reuse old data, and combine data sets in novel ways. Machine learning applications, for example, can be trained on datasets that are pre-verified to be clean and quality assured, making model building faster and easier for data scientists.
2. Create new lines of business
By adding sensors to everything from farm equipment to the watches on our wrists, the Internet of Things (IoT) is creating a flood of data. The flood is dirty: Most organizations find themselves with data of questionable origin, making it hard to create viable business models.
Up to 80% of location data could be fake, according to Digiday. Malfunctioning sensors, anomalies, and outliers add to messy feeds. Identifying and removing fake or odd data by hand takes so much time that it slows the ability to launch data-driven products and services. Data provenance helps organizations quickly identify fake or odd data, creating a faster way to clean data and create actionable business models.
3. Secure systems
Pacemaker hacks, one of the most terrifying examples of a breach in data security, also illustrate how provenance can be life-saving. A certain model of pacemaker was recently found to be hackable. Someone within Bluetooth range of the patient could break into the pacemaker’s unencrypted communication protocol, which didn’t need authentication, and create heart abnormalities or stop the pacemaker from working. The pacemaker has since been recalled, but the need to secure medical devices is clear.
Data provenance would make it immediately noticeable when new code enters a system. Systems cybersecurity strategies must include data provenance checks in order to mitigate the injection of “weaponized” data, and ultimately reject “poisoned” data.
4. Restore credibility
From bogus scientific research results to questionable news sites, data has a trust problem. Compliance regulations are growing in response. Once-promising business models are having a reckoning. For example, the entire advertising industry is, as Digiday puts it, “star(ing) down the barrel of big fines for collecting and processing data illegally.” Data-driven advertising suddenly looks a lot less reliable amidst revelations of fake data and regulation.
Data provenance can restore credibility by allowing consumers of information (readers, subscribers, customers, etc.) to easily track data to its source. It can also enable new kinds of badges or certifications of authenticity. With provenance, data is also more easily searchable and traceable, so businesses facing compliance demands can quickly, well, comply.
5. AI
Artificial intelligence only lives up to its name when it ingests quality data sets. Fed poor data, AI fails. Provenance ensures that data is relevant, complete, and traceable. AI, in turn, uses the right data at the right time, employs statistically complete data sets, and can show its work at every step of the way. The latter point helps engineers inquire into the AI’s reasoning at every step, averting “black box” outcomes such as the social media algorithms that prioritize extremist content.
As automation continues to tear through the economy, data provenance could even provide new sources of income. Provenance enables new business models that compensate the rightful owners of data for sharing.
Anyone who shares their personal data with a company has a right to be compensated for it, but this is not a widespread practice. With provenance, social media users, for example, could be paid to share derivatives of their identity, instead of being nudged in the position of doing so in exchange for the privilege of using a platform. Farmers who share sensor data with crop analytics platforms could be similarly compensated.
In other words, making data traceable through a lineage opens up all kinds of new opportunities. It is a core element of Web3: the coming version of the internet in which data and machine-to-machine communication play a central role. Clearly, provenance has a prominent future.