Team Data — data architects in particular, whatever their official title may be — have a job to do. That is to take more aggressive steps to protect data assets on behalf of customers and constituents.
The responsibility falls to data architects because few in most organizations are thinking about protecting data from the data point of view, so to speak. The lawyers and the security group think about it from the standpoint of data breaches, applications, SQL injections, and so on.
Everything starts with a data catalog for classification and data integration. It revolves around “the identification of data elements — privacy, security, confidentiality, usability, context, allowed uses, or any other data protection-like requirements,” said Karen Lopez, Senior Project Manager and Architect at InfoAdvisors. She was speaking on the topic at the DATAVERSITY® Enterprise Data World Conference during her presentation titled Data Categorization for Data Architects.
A data catalog may live under other names, such as data classification, data sensitivity, data characterization, data inventory analysis, and data categorization. Data curation — the process of assembling, organizing, managing and insuring the usability of a collection of data assets with the goal of sharing — is also something worth implementing.
“What should a data architect be doing to assist their organizations in classifying data?” Lopez asked. And equally important, why should they be doing it? The General Data Protection Regulation (GDPR), the Personal Information Protection and Electronic Documents Act (PIPEDA), the California Consumer Privacy Act (CCPA), and Family Educational Rights and Privacy Act (FERPA), for instance, all require classifying data in some way. “Data architects have to understand the nature of a piece of data, and a row of data, and an instance of data, and how it can be used together,” she said.
GDPR has been the real spur to date, given the potential fines for non-compliance. Senator Ron Wyden’s Consumer Data Privacy Act proposal — a draft of which was released in early November — could take things even further. It suggests a penalty of 10 to 20 years in jail for executives whose companies don’t adhere to the rules. The possibility of prison is a big motivator to take aggressive steps.
“Right now, you should be going to executives and saying, ‘I need six million dollars to keep your butt out of jail.’ This is our time to be asking for assets to do cataloging and categorization of data,” Lopez said.
It’s not just because data scientists spend 80 percent of their time sourcing, cleansing, and prepping data, but because serious privacy violations are going to result in serious harm to the company, to the shareholders, to the board of directors, to the C-level people, and to the customers.
Room for AI in Data Classification
Data classification generally happens by pointing a tool at a database, reverse-engineering it, and guessing what’s in that database based on the column names. That’s how you wind up with inelegant column names like “retail transaction line item modifier event” — and even longer ones once developers have their way with renaming it.
“The attribute could become ‘retail transaction line item modifier event reason code,’ and you’ve got to keep track of the reasons someone overrode the line item on every retail transaction,” Lopez pointed out.
What she wants to see are semantic-based ways that will make it easy to tell what a column means in context; for example, that CST means “cost” in one context, and “customer” in another. “If you had to go back to your organization and do a data discovery of every piece of data that your enterprise uses and find the definition of it, how long would that take you?” Lopez asked.
An AI-based tool would be ideal, where column names and content can be determined by algorithms; for instance, identifying that three digits, a dash, two digits, a dash, and four digits is a social security number. Along with that, she’d like there to be an agent that automatically sends an alert for things like a user creating a spreadsheet with credit card numbers and Social Security numbers so that can be prevented.
Cheaper Ways to Do Data Cataloging
Many companies still aren’t using the current generation of data cataloging tools because of high costs. One way to get around that is to pursue “a catalog tool light and to start at the physical level of reverse-engineering databases that will give the most bang for the buck. And certainly, don’t try to go about cataloging everything in one big project, but rather incrementally.”
One of the catalogs she works with is available from Microsoft Azure (Lopez gave full disclosure that she is a Microsoft MVP). The cloud service doesn’t require licensing and only requires downloading a small web utility for connecting to data sources. Architects provide the data and it reverse-engineers the tables and columns and any of the metadata that’s in the database, and they’ll be able to add and tag and comment on the data.
Right now this supports Microsoft SQL Server, SQL Data Warehouse, SQL DB, Oracle, two types of Hadoop, HDFS, Hive, DBT Teradata, MySQL, HANA, and Salesforce. There’s also a generic ODBC driver just like those in data modeling tools. It keeps track of the source, what type of object it is, when the catalog was last updated, and when the data was registered in the catalog.
Microsoft also has announced the Azure Data Catalog managed service. The vendor describes this as a way for any user to discover, understand, and consume data sources. Data Catalog includes a crowdsourcing model of metadata and annotations. It is a single, central place for all of an organization’s users to contribute their knowledge, and build a community and culture of data.
“We tried to do all this data cataloging and data classification stuff in the ’80s, the ’90s, the aughts, and it always got cut from funding,” Lopez said. But now, she thinks, “you are going to have people sending you money, asking you how you’re going to use it to keep them out of jail, right?”
Steps to Take
Data Modeling, she says, is very similar to data cataloging because it’s about reverse-engineering to understand what assets you have. Lopez says data architects should ask vendors in this market to provide more rich features for keeping track of PII and other sensitive data, so that they can start classifying that data correctly in their data models. “This is where it should be, right? This is important metadata,” she said. “I think it should be much more complex. It should be multi-dimensional, meaning it can be both PII and GDPR.”
Question whether they support the most recent privacy and security features of their target DBMS, too. Tell them you are currently looking for a tool that supports the modern security and privacy features to put in the data models, and not leave it up for the DBA to do.
Finally, anyone on Team Data should ask for data-related GDPR training.
“We have a role here, and I’d like to inspire you to go get training, go ask your modeling tool vendors to do rich features, and get playing with a catalog,” she said.
Check out Enterprise Data World at www.enterprisedataworld.com
Here is the video of the Enterprise Data World Presentation:
Image used under license from Shutterstock.com