Click to learn more about author Bob Vecchione.
The inspiration for sharing this perspective came from a recent data security initiative with a top 10 North American financial institution who has operationalized its Data Lakes.
Due to enhanced visibility of the data on the business side, and at the behest of the data security team, this particular data marketplace team was challenged to provide assurance that certain data attributes would be identified and “handled” by the intelligence within the Data Marketplace. The client team set out to solve this problem by creating a “short listed” Proof of Concept, challenging technology vendors to demonstrate automated identification and action on potentially sensitive data.
In short, the tests included ingesting a provided sample data set, and then demonstrating the platform’s ability to radically simplify and accelerate the way this organization could manage, prepare and deliver self-service, business-ready data INCLUDING identification and governing of potentially sensitive data.
Information Governance covers a spectrum of topics including the accuracy, integrity, consistency, accessibility, privacy, and security of information across the enterprise. We will not discuss every aspect of Data Governance here, but instead focus on data security and what that means in this context.
Spotlight PII: Personally Identifiable Information
By design, data marketplaces deliver high levels of self-service data to business users helping drive faster insights within the business. And easier access to more data only heightens the importance among corporate security teams to properly secure all this newly available information.
If you are in the process of building out a well-designed and effectively implemented Data Marketplace, one of the many challenges you are likely facing is how to secure your data. All of the standards apply: AD, ACL’s, Kerberos, encryption, and so on. However, an often-overlooked aspect of data security is: How do you even know what data to secure? Some take the approach of locking everything down – which limits access to the users who need the data, diminishing data’s value exponentially. Others may define the attributes that they “think” are sensitive which puts unknown attributes at risk of not being secure when they in fact should be.
“Through 2018, 90% of deployed data lakes will be useless as they are overwhelmed with information assets captured for uncertain use cases.*”
To help remedy this, choose your solution wisely. There are a few exemplary providers in the industry that do PII detection, and do it well. Look to those who have developed a process and technique as a part of data on-boarding, detecting patterns that may be sensitive early on, and at a field level. And it’s not just for security reasons. Many challenges that plague data lake enterprises have been solved or mitigated by managing all data, including data access rights, from the point of ingestion.
Data Professionals on all Levels: You Need to Own it.
CIOs, CTOs, Chief Data Officers and those involved in, or responsible for, information governance practices in compliance are feeling the heat more than ever. Why? Because most Data Lake implementations are focused on storing and processing data, not governing it.
And statistics reveal – more likely than not – that silos exist within IT organizations that are often staffed with individuals at varying skill levels of governance, adding further vulnerability and risk to the business at large.
“Fewer than 10% of Data Lake organizations have formalized their approach to governance.*”
The Blend: Metadata, Governance & Security
Over the last 2 – 3 years, many organizations have either built out, or are in the process of building out a data lake. The appeal of going into the lake is strong; promising self-service on-demand access to all data regardless of where it lives. But too often, IT managers soon realize a self-service Data Management model is not achievable because of unmet internal data security levels.
This means the well-intentioned Data Lake has become more of a liability than a shareable repository with a single – or few – approved users to provide secured data on demand. This defeats the purpose of the lake to begin with, is expense, a resource drain and a downer for information seekers.
Big Data requires us to rethink Data Governance from the ground up. Instead of physically separating sandbox and production data, Big Data Governance logically controls access and usage as data matures from “raw” to “ready.” How can you tell if data is ready for production? Look to the metadata. Any
Big Data platform supporting production usage data must have metadata tracking the lifecycle of data ingestion, security, validation, preparation, and use.
Metadata = Better Data
Metadata needs to manage data access rights, capture data profiling results, and commentary by data developers and end users. Metadata stores the policies that define production readiness, and is able to enforce them. Without metadata, the lake becomes virtually unusable and a significant security risk.
Few people actually know how to map a company’s internal security policies to data in a meaningful way. Let today’s commercial technologies remove the manual guess work, as no one can possibly know the content of every field across every source. Leveraging rich metadata with automation through pattern recognition serve as the foundation for driving big strategic initiatives throughout the business.
As you strategize about your journey for delivering self-service nirvana, security, driven by metadata and governed through well-defined processes, is the way to not only mitigate risk but deliver on the true promise of going beyond the traditional Data Lake to a self-service data marketplace.