Successful Data Management needs a Data Inventory and a Data Catalog. Operating a business without these is like ordering a meal in a restaurant without a menu, where the waiter and waitresses ask the patron to choose a platter from some vegetables, meats, and fruits, that happen to be on hand. An establishment like this would not last long.
As companies consider Big Data, the bread and butter of the business, many companies do know what data they have and where it is located. This results in frustrated business users and regulators shown the raw data (e.g. meats, fruits and vegetables) to do their jobs (e.g. eat a dinner). This pattern needs to change. As April Reeve noted, in her Enterprise Data World 2017 Conference presentation titled The Data Catalog – The Key to Managing Enterprise Data Big and Small, a Data Catalog must be implemented as additional pieces to the distributed data environment.
Companies have tried taking stock of and indexing their data in the past. During Reeve’s 25 plus years work as an Enterprise Architect and program manager, she acknowledged that Data Catalogs and Data Inventories have been attempted, sometimes well and other times not so successfully. She said that, in the “1980s and 1990s, the movement towards Data Catalogs as well as Metadata” had not resulted in a lot of successful projects. So, Data Catalogs have fallen by the business wayside.
According to Reeve, Principal at Reeve Consulting, organizations now must have a Data Catalog a to keep up with managing Big Data and Enterprise Data, and therefore must take a different approach to what has been done.
Metadata in the Data Catalog and Data Inventory
Metadata elements make up the Data Catalog and the Data Inventory. Reeve defined Metadata as “the information we have to collect concerning the data that we are collecting.” She broke Metadata down into three types:
- Business Metadata: Provides the meaning of data, by defining terms in “every-day language without regard to technical implementation,” said Reeve. Business Metadata, she noted, answers questions of “Where does [the data] fit in with our business? What is the relationship of [data] to other data in the organization,” This forms the flesh and bones of the Data Catalog, a repository of what data can be found in a company.
- Technical Metadata: Provides information on the format and structure of the data as needed by computer systems. As Reeve said:
“When people talk about Metadata, they tend to be talking about Technical Metadata. Any kind of environment [Databases, Data Modeling Tools, programming environments] [are] going to have a [technical] Metadata Repository that is attached to that tool.”
Business Intelligence, Artificial Intelligence and humans use technical Metadata. Technical Metadata provides a basis for a Data Inventory.
- Operational Metadata: Provides an “audit trail of information about where the data came from and who created it,” defined Reeve. Every heavily regulated industry, from finance to pharmaceutics and health must have Operational Metadata, “generated and captured when a process executes.” Reeve stated that “Operational Metadata allow administrators to manage the system and ensure [processes] run smoothly.” Operational Metadata forms a critical component of the Data Inventory.
Data Catalogs and Data Inventories: Now Critical Data Management Components
Reeve asked, “How can you say you are managing your data if you don’t have an inventory of what data you have?” She explained:
“A fundamental concept in terms of doing Data Management is that you actually have to have a list of what data you have where it is and what it means.”
Unfortunately, according to Reeve, new open source technologies, most importantly Hadoop, Hive, and other open source technologies do not have inherent capabilities to handle, Business, Technical AND Operational Metadata requirements. Firms cannot afford this lack as they confront a variety of technologies for Big Data storage, noted Reeve. It makes it difficult for Data Managers to know where the data lives.
The increasing popularity of open source and Hadoop technologies has:
“Led analyst organizations, including Gartner Group and Forester, to [predict a] lack of built in Metadata repositories in Big Data technology will be the biggest risks in success for Big Data projects,” said Reeve.
Reeve noted that “through 2018, 80 percent of Data Lakes will not include effective Metadata Management capabilities, making them inefficient.”
Reeve said that if “Hadoop is in the center of [all the Data then] companies need additional capabilities to handle Metadata requirements.” Data Managers “need to answer that problem of where so we get the Metadata and fill in the gap in the Data Architecture.”
According to Reeve, this means addressing not only the traditional Metadata questions of where to capture the Metadata, business meaning, technical structure, and format and audit trails, but also automating Metadata be to handle high volumes and consolidating Metadata from a variety of sources, quickly.
Reeve’s mantra stated: “[Know] the structure of data before running it in production.” She acknowledged that:
“When data is first brought into the analytic environment, a Data Discovery process needs to go on. First thing, actually look at the data and see what’s in it and if it might meet a [business] need.”
However, do not operationalize Big Data without understanding its structure.
A Data Catalog provides a required data solution. “People struggle to get access to data and need to be enabled to get the data they want, appropriately.” A Data Catalog provides a place to find out “what data is available, what does it look like, and how is the data formatted,” explained Reeve.
Data Management needs:
“Processes for data consumers to identify available data, request access, receive appropriate review and approval, and have data provisioned, without taking huge amounts of time and IT effort.”
A Data Inventory provides the first step toward creating a Data Catalog.
Differences Between Data Inventories and Data Catalogs
Reeve stated that:
“A Data Inventory is a physical list of what data you have and where it is located. It tends to be more on the Technical Metadata side, but it also may say what the business meaning of that data.”
She contrasted this to the Data Catalog: A business view of what data someone may have access to, that may even drill down to the technical perspective of description. “A Data Catalog is a menu of what data is available from which a user selects and, if access approved, data is provisioned.”
Given that Data Inventories and Data Catalog describe different dimensions, “an item in the catalog can be multiple places in the inventory and vice versa, an item in the inventory can be referred to multiple places in the catalog,” reasoned Reeve. The Metadata aspects presented by the Data Inventory and Data Catalog are different.
Creating the Data Inventory and the Data Catalog
She stressed that “integrating Metadata is a very hard thing to do.” Businesses face “multiple sources of information that they want to bring together and connect relationships between that Metadata to one another.”
Data Integration Vendors and ETL tools, like Informatica or IBM attempt to solve these Data Inventory Management issues through specific technology suites. By focusing heavily on the Technical Metadata, these systems become:
”Very focused on the technology of doing the extracts and integration, and lose a lot of the business point of putting a catalog on top of that inventory providing a service,” said Reeve.
To successfully create a Data Inventory and Data Catalog, “focus on the end goal, automate as much as possible, but don’t just get caught up in the technology,” recommended Reeve. The expectation towards handcrafting Metadata needs to be minimized.
Reeve said, “we can’t handcraft this stuff anymore. It is just not possible. The volumes are just too big. The variety is too much. Integrating Metadata together should be as automated as possible.”
Given this, “Technical Metadata must be updated on a periodic automatic basis, but Business Metadata may not be updated as frequently.” As a final thought, she said:
“There needs to be processes for data consumers to identify available data, request access, receive appropriate review and approval and have data provisioned without taking [massive] amounts of time and IT effort.”
Check out Enterprise Data World at www.enterprisedataworld.com
Here is the video of the Enterprise Data World 2017 Presentation:
Photo Credit: David Leshem/Shutterstock.com