Speaking at the DATAVERSITY® Enterprise Data World 2016 Conference, Jeremy Posner, Senior Director of Data Management and Strategy at Synechron, offered the Enterprise Data Catalog as a useful tool in three use-cases: eDiscovery, Records Management, and Data Sourcing Integration Points. Posner says that the temporal database of an Enterprise Data Catalog can serve as a “time machine” to show the state of your data at any point in time, and this provides multiple benefits.
What is an Enterprise Data Catalog?
Posner defines an Enterprise Data Catalog as an inventory of data assets. He says it is a corporate resource where data can be found and a repository of Metadata about data stores, complete with locators to find the data, information about who is responsible for it, and who has access to it. “This is not a Data Dictionary. It’s not a field-level technical Metadata Repository. It’s a catalogue of the data assets at the container level. “
What are the Data Assets Involved?
Data stores are not the hardware or the software but “are actually containers – containers of your firm’s information,” he says. They include databases – structured or unstructured, file systems, including HDFS, content management systems, virtual data rooms, email servers, voice archives, Cloud stores, and physical stores. He says that most of the data within the corporation is unstructured. This includes data on all devices and everywhere that data can exist within your organization, including BYOD data, in the Cloud, in your pocket, and in the Enterprise Data Center.
What Can an Enterprise Data Catalog Offer?
While actual implementation may vary, Posner says the Data Catalog serves as a single place for organization to contain its assets. Key functions include certification of data assets, an ability to see which other hardware and software support the data, which retention schedules are applied, where copies, backups and archives exist, any legal holds on the data, support for “defensible disposal” requirements, and authoritative sourcing. He says that the Catalog can be extended to provide business-level (not technical) data lineage, and access management information.
What is the Need for an Enterprise Data Catalog?
Posner focuses on legal and regulatory drivers for knowing “where your data is, which legal jurisdiction it’s governed by, and how it’s classified.” He says that companies should be able to prove that they know:
- Location of data
- Legal constraints on the data
- Data taxonomy
- Who is responsible, both on the business side and the technology side
- If data is a golden source, a copy, or an approved distributor of golden data
- What retention policies exist and that policies are being applied correctly with respect to archiving and disposal
- Where the data is used and its lineage
- Who has access to it and when it was accessed
Uses for the Data Catalog in Records Management and eDiscovery
“Generally, the technology organization creates the application or the device, and the business unit creates the content, and eventually the content is archived and disposed, so it’s like a life cycle,” he says. The Enterprise Data Catalog tracks when the content is created and when it’s removed.
Using a slide showing the interplay among various components of a typical system, he illustrated how, in response to a legal query, business units can use the Data Catalog to quickly find needed information about the creation of content, archiving and disposal policies, history, and who has had access to that data.
The typical flow for an eDiscovery query is:
- The regulator comes in and asks for some information, they go to legal. Legal opens a case, they then go to IT operations, and ask them to put what they call a “legal hold” on the information. A legal hold means a business doesn’t dispose of it until the investigation is up. And the key thing about the eDiscovery use case is it requires information historically – this is very important because these cases often go back six months, twelve months, maybe even two to three years. An organization needs to know that the information in their Data Catalog and data environment was correct at a certain point in time.
He says that having “the ability to do time travel,” by having a database of temporal information that makes clear who had access to the data and when is very important in the eDiscovery process. He says that Records Management also needs to know that data has a retention policy attached to it.
Uses for the Data Catalog in Data Sourcing
Posner showed how the Enterprise Data Catalog can be used as a registration point for data authority:
“Large organizations have a big problem with managing their information. There’s far too much data copied around the organization. The big problem we have is, where do you go to get the right information, the golden source? Where should I go to get information on customers [when] it exists in 20 different databases? So the challenge here is identifying which data store is authoritative for each kind of information, and keeping it updated.”
Data Catalog Challenges
The challenge of getting all the information into the catalog, he says, requires hardware assets, and firm-wide data policies need to be in place at the highest level early in the process.
“It really all starts with policy. Unless you’ve got a policy in place at the firm’s highest level, then people won’t be incentivized to keep this information up to date, to be the stewards of this information, and ultimately the quality will suffer.”
The governance model can also be a challenge. Posner says that companies need to decide how to keep the information within the catalog up to date, to create policies, and processes based on different responsibilities for different parts of the Data Catalog. He says, “The guys in Records Management will be interested in making sure that the retention policies are up to date, but the people within our operations – that’s not their area. So you need to have a shared responsibility governance model.” Regulation can drive these processes, and usually requires that the Enterprise Data Catalog is fresh, maintained, and complete.
Certification of Data Assets
Every six to twelve months, it’s important to certify that the information in the Data Catalog is correct, he says. This can be done using a diary that shows all assets, how they’re related, what servers and hardware they live on, and what software is used. “It’s relating the hardware and the software to the actual data assets.” Posner adds that it’s also important to know which retention schedules are applied. He stresses that frequency of re-certification should be based on the value of the data: “The more critical information that you have in your catalog, then the more often that you’ll want to make sure that the re-certification takes place.”
Architecture and Delivery Options
Although Posner’s team decided to have a three-tier Web Java database custom built, there are off-the-shelf Data Repositories that can be bought, configured, and extended. He says the key requirement is the need for a temporal database that is accurate enough to withstand legal query, and most of the products he’s seen right now don’t offer that option. Because there are non-technical users of the system, he says a good user interface is also important, as well as robust integration capability.
As Posner says, companies hoping to invest in a time machine can now do so – but to be safe, he recommends calling it an Enterprise Data Catalog.
Here is the video of the Enterprise Data World 2016 Presentation:
Register for the Enterprise Data World 2017 Conference Today (in Atlanta, Georgia)