Complex data models have now become the norm. A single stream of the data can travel through many hubs, and many different technologies. It may travel through the front end, the APIs, the Kafka pub/sub systems, Lambda functions, ETLs, data lakes, data warehouses, and more. Riding within this stream of data is the schema, and each schema also has its own technology-specific terminology and syntax, as well as its own data types and life cycle. As a result, over time, the job of data modelers has become significantly more difficult.
Another factor making it more difficult for data modelers is that they must also understand the target technology. Data is consumed in many more ways — some examples being machine learning, natural language processing, artificial intelligence, blockchain, etc. The world of data has become dramatically more complex than it was in the past. All of the choices available in NoSQL databases and communication protocols add a layer of complexity in Data Modeling.
Physical Data Models get Complicated
Physical data models present an image of a data design that has been implemented, or is going to be implemented, in a database management system. It is a database-specific model representing relational data objects (columns, tables, primary and foreign keys), as well as their relationships. Also, physical data models can generate DDL (or data definition language) statements, which are then sent to the database server.
Implementing a physical data model requires a good understanding of the characteristics and performance parameters of the database system. For example, when working with a relational database, it is necessary to understand how the columns, tables, and relationships between the columns and tables are organized. Regardless of the type of database (columnar, multidimensional, or some other type of database), understanding the specifics of the DBMS is crucial to integrating the model. According to Pascal Desmarets, Founder and CEO of Hackolade:
“Historically, physical Data Modeling has been generally focused on the design of single relational databases, with DDL statements as the expected artifact. Those statements tended to be fairly generic, with fairly minor differences in functionality and SQL dialects between the different vendors. But at the scale used by large enterprises today, these models get complex.”
Nowadays enterprises have embraced modern IT architectures based on APIs and microservices, including complex communication protocols with message queuing, remote procedure calls, etc. “They perform polyglot persistence using different types of databases with specialty databases from a variety of NoSQL vendors” said Desmarets. Each of them with a very different storage model. Data is consumed in many more ways with machine learning, natural language processing, artificial intelligence, blockchain, and others. So, the environment is getting drastically more complex than in the past.
“It used to be that they were just generating DDLs, and that was fairly simple in terms of target technology. Now the data modelers need to understand and integrate the characteristics of each technology so the physical data model can truly leverage the respective benefits.”
Polyglot Data Models
A “polyglot” is someone who knows and speaks many languages. The expression “polyglot data model” means multiple database technologies are used to read specific types of data. With a polyglot data model, data services can use and interact with different database technologies, offering multiple ways of handling and accessing data.
However, many organizations are working with traditional logical models that fall short of this goal. There is a need for new data models representing data that is at-rest and data that is in-motion. Modern data contains complex nested data types, and it can be polymorphic, requiring much more effort to translate a traditional logical model into each of the very different physical schemas different technologies use.
Desmarets commented on polyglot data models, saying:
“We see that companies have accumulated conceptual and logical models to describe their business and the information systems of the enterprise. That’s a big investment they’ve made. Obviously, information architecture departments want to leverage that investment, even though the technology is evolving and becoming more complex.”
Based on the feedback from his clients:
“We think that there’s a need to expand the definition of a logical model. While it remains technology agnostic, a logical model should not just be the least common denominator of data definitions with the risk of making compromises to fit the most constraining technology.”
Complexity and Scale
The greater the variety of technologies being used, the more complex the organization, and its physical data model. Different departments within an organization can be viewed as links in a chain, with some using different technologies. Consequently, not all the links in the chain can be changed with one simple command, nor at the same time. Desmarets offered this comment:
“The ability to handle complexity and scale is another challenge. It seems that companies are having hundreds, sometimes thousands, if not tens of thousands, of APIs and microservices, and these are handled using dozens of different technologies. Their schemas are flying around, each of them having its own lifecycle.”
The number of microservices and APIs dictate the scale that is needed to operate efficiently. A few decades ago, when monolithic applications using a three-tier architecture were popular, scale wasn’t that much of a concern. However, today’s modern systems use a variety of services, and increasing the scale of the system to match the services is mandatory.
The Future of Data Modeling
The use of Data Modeling will become more and more important as the need grows for understanding how the system works and how to manipulate it. Metadata (the data tags used to find data) will become a bigger priority for Data Modeling in 2020. This is, in part, due to its importance during the research process. Including metadata in the data model makes it easier to visualize and establishes its importance in managing data.
When asked about the future of Data Modeling, Desmarets remarked:
“Our roadmap is made of two main tracks. One is for adding features that every data modeler expects and needs out of an application to perform data modeling, even for NoSQL, and schema design. At the same time, we’re adding support for target technologies, to satisfy the growing needs of our clients: more NoSQL databases, JSON in relational databases, big data analytics platforms, storage formats, cloud databases, communication protocols, etc.”
Hackolade is currently focused on creating this polyglot data model, which allows modelers to define the structure once, and be able to generate schemas in these different technologies in a very convenient manner. Customers are facing this new challenge, so Hackolade abstracted it all into a lineup of upcoming features.
“Another project we’re working on,” he said. “maybe more tactical, is the ability to infer the schema of JSON stored in blobs of relational databases, leading to a more complete data model of semi-structured data.”
The intention is not at all to compete with the established data modeling tools, he remarked, but to complement them and add value for customers. Hackolade’s focus is on solving new challenges for organizations while leveraging their existing investments in these traditional tools.
Image used under license from Shutterstock.com