I was privileged to deliver a workshop at Enterprise Data World 2024. Publishing this review is a way to express my gratitude to the fantastic team at DATAVERSITY and Tony Shaw personally for organizing this prestigious live event. Part 1 of this article considered the key takeaways in data governance, discussed at Enterprise Data World 2024. Part 2 of this article will explore the subject area of data architecture and modeling.
Data Architecture and Modeling
Data architecture and modeling are among the most unaligned concepts in the data management community worldwide.
The TOGAF® Standard, the most well-known framework for enterprise architecture, considers data architecture to be one of four types of enterprise architecture. These four types are business, data, application, and technology architectures. They define data architecture as “A description of the structure of the enterprise’s major types and sources of data, logical data assets, physical data assets, and data management resources.“ Data modeling deliverables are a part of data architecture.
At the same time, DAMA-DMBOK2 has separated data architecture from data modeling.
In the data management community, people take another approach: they substitute three types of architectures (data, application, and technology) with one term, “data architecture.”
Before reading this paragraph, you should define your understanding of data architecture. Considering the number of unaligned definitions in the data management community worldwide, I advise my customers to create a data management glossary for their companies to ensure clear communication. I hope you will follow this advice as well.
Data architecture and modeling were among the most discussed subjects at EDW 2024. One of the reasons is that enterprise architecture significantly impacts all other data management capabilities like data governance, quality, analytics, etc. Another reason is that “data mesh,” “data fabric,” and “data product” have become buzzwords over the last several years.
Let’s have a deeper look at the topics related to these two subject areas.
Data Modeling
According to DAMA-DMBOK2, data modeling is “a process of discovering, analyzing, and scoping data requirements, and then representing and communicating these data requirements in a precise form called the data model. This process is iterative and may include a conceptual, logical, and physical model.”
There are several takeaways to pay attention to:
- Data modeling practices undergo several severe changes due to some industry trends. According to Steve Hoberman, key trends are “Mainstream NoSQL projects, Knowledge graphs getting the spotlight, and growing AI.” These trends require new skills for data modelers, including leveraging AI. Steve thinks that at some point, “AI can become a data modeler.”
- According to the classical approach, an enterprise data model includes three types of data models: conceptual, logical, and physical. However, in practice, many companies move away from this approach and tend to document and link semantic and physical data models.
- According to the classical approach, an organization should develop one enterprise data model. However, this is hardly realizable in organizations with complex business models and domains. The reality and the new types of data architecture (business domain and data mesh) lead to the necessity of developing data models per data domain and then linking these multiple models with each other, usually at the semantic layer.
- According to Pascal Desmarets, domain-driven data modeling has several benefits: It helps “reconcile business and IT through a shared understanding of the context and meaning of data, “smaller data models are the blueprints for complex applications,” and it “can be derived in any technology.”
- Developing conceptual and semantic models is highly recommended. However, you should know some challenges associated with these two models.
The first challenge is the relationships between these models. DAMA Dictionary defines a semantic model as “a conceptual data model that provides structure and defines meaning for non-tabular data, making that meaning explicit enough that a human or software agent can reason about it.” Thus, we may conclude that a semantic model is a type of conceptual model that includes semantic information.
The second challenge is that DAMA-DMBOK2 and the TOGAF® Standard disagree on the key components of the conceptual/semantic model. DAMA-DMBOK2 includes “business subject areas,” “concepts,” or “phenomena” as its core subjects. The TOGAF® Standard operates with data entities as the core components of a conceptual model, while data entities are components of a logical model in the DAMA-DMBOK2 approach.
Steve Hoberman, a data modeling guru, defines this type of model as a “business terms model” that captures business concepts, their definitions, and relationships.
In any case, developing conceptual models, according to Laurel Sturges and Kasi Anderson, has multiple benefits such as:
- Aligning and simplifying business terminology and facilitating communication across different business domains
- Identifying business scope and needs for data-related initiatives
- Empowering collaboration between diverse data users
Generative AI will significantly impact data modeling. According to Nvidia CEO Jensen Huang, “Generative AI will be more of an operating system, and humans can tell computers in plain language to create applications. Huang said large language models (LLMs) will help humans run their ideas through computers.”
Different Types of (Data) Architecture
I’ve already discussed the challenge associated with the data architecture definition at the beginning of Part 2 of this article.
The key topic discussed at EDW 2024 regarding data architecture was analyzing and comparing different types of integration architecture for various business needs.
So, several takeaways are:
- So far, companies have implemented the following data integration architectures: data warehouse, data lake, data lakehouse, data mesh, and data fabric.
Each of these architectures has its advantages, disadvantages, and areas of applicability.
- Data warehouse architecture helps store data from multiple sources in a central repository. This data is to be used for historical and trend analysis. It ensures a single version of the truth.
- A data lake holds raw data in its native format without modeling. It ensures quicker access to data, improves performance, and keeps all historical data.
- A data lakehouse combines the functionalities of a data warehouse and data lakes.
- According to Gartner, data fabric is “an emerging data management design for attaining flexible, reusable and augmented data integration pipelines, services and semantics.” Data fabric adds some technologies to the data lakehouse (i.e., metadata management).
However, a data fabric is not a product you can buy. Implementing one requires integrating different technologies, products, and functionalities.
- Data mesh is a decentralized approach to managing data that covers different layers of architecture: data, application, and technology. This approach includes several core principles related to domain ownership, data as a product, self-service infrastructure, and federated computational governance. However, according to James Serra, many concerns regarding data mesh exist, such as “no standard definition of data mesh, huge investments required for its implementation, performance problems of combining data from different domains, etc.”
Efrain Rodriguez demonstrated the progress the U.S. Department has made in applying the data mesh concept to its practices.
- According to John O’Brien, the choices regarding the required data integration architecture must be made based on aligning data architecture principles and best practices, business needs in analytics solutions, and technology and vendor products. As James Serra said, “No one-size-fits-all architecture” exists.
- Data architecture for operational systems differs from data integration architecture.
According to Dave Wells, the problem with the current approach is that all data integration architectures discussed above focus on analytical data, while operational systems have complex, many-to-many connections. Examples of operational systems architecture are operational data stores and hubs.
Data architecture for operational systems must ensure proper data integration and interoperability. Data interoperability enables the sharing and exchange of operational data, the reconciliation of master data, and the preparation of analytical data.
According to Dave Wells, interoperability of all data types requires standard language for clear, non-ambiguous communication. A semantic layer is the means to create such a communication. Establishing a semantic layer requires corresponding semantic data modeling and graph technologies. Data ontologies and taxonomies form the basis for building knowledge graphs.
- Progress in developing data mesh architecture led to the development of the concepts of data products, data contracts, and APIs (application program interfaces) for data (product) sharing.
Similar to the concept of data mesh, the concept of a “data product” does not have aligned definitions in the data management community. The definition of a “data product” can include multiple components. A data product can consist of components from four levels (according to Dave Wells):
- Data (logical and/or physical level)
- Corresponding metadata of different typesProcessing (i.e., data mapping, security, and access authorization)
- Interface (i.e., UI (human) and API (software))
Data products developed by ExxonMobil also have complex structures, including data pipeline orchestration, data processing, modeling and analysis, data quality, data storage, mapping on the service layer, etc.
A data contract acts as a service-level agreement for sharing and delivering data products from the data providers to the data consumers.
It covers data structure, format, semantics, quality, and terms of use. Data contracts can be in human – or machine-readable format.
- Data architecture projects must deliver multiple artifacts to ensure project success.
According to Clay Rehm, documentation must include data flow and conversion diagrams, system integration diagrams, source-to-target documentation, conceptual, logical, and physical data models, etc.
It is worth noting that these artifacts apply to both current and target data architectures.
- Data architecture has strongly impacted modernizing data analytics and data science.
Many data analytics projects failed because of complexity. According to Mark Madsen, there are three types of complexity: technical (too many tools), usage (moving from reporting to Data Science and AI), and data (lack of proper data management and governance.) He mentioned several approaches to reducing complexity, such as decomposing technical architecture into subsystems that change at different rates, separating applications from the infrastructure, etc.
In Part 3, I will share key takeaways on data science and other data management capabilities.