Following the discussion of Data Architecture and Data Structures, the next questions are What’s the difference between data schemas and data structures? and How are data schemas, data structures, and data models formally named?[i]
Numerous attempts have been made over the years to formally name specific data structures and the data models containing those data structures. Most of those names are related in some manner to the Zachman Framework for Information Systems. However, confusion arises between naming the type of data structure, the subject area covered by the data structure, and the data model containing the data structure. The result is an array of confusing data structure and data model names that contribute to a disparate data resource.
A much better approach is to formally name each data schema, formally name data structures based on the data schema name and the subject area, and then formally name data models based on the data structure contained in the data model.
A schema is a diagrammatic representation; a structured framework or plan; an outline. A data schema is a diagrammatic representation of the data structure. A data schema is simply a type of data structure. A data structure is a representation of the arrangement, relationships, and contents of data in an organization’s data resource. It is directly related to formal data names, comprehensive data definitions, and precise data integrity rules, and must be documented. A data structure is a manifestation of a specific data schema for a specific purpose within an organization’s data resource.
As databases emerged in the mid-1900s, two data schemas were identified by database professionals. The internal data schema represented the way that data were stored in databases and the external data schema represented the way that data were used by applications outside the database. Note the name orientation toward physical databases.
Since the internal and external data schemas were often quite different, and multiple external data schema could be developed for a relative few internal data schema, a third conceptual data schema was defined as the common denominator between the internal and external data schemas. That term has been used, misused, and abused to the point that it is unclear today what a conceptual data schema actually represents.
To bring more meaning and understanding to these three data schemas, the internal data schema was renamed to a physical data schema, the external data schema was renamed to a business data schema, and the conceptual data schema was renamed to a logical data schema. These more meaningful names also established the sequence of business-to-logical-to-physical data schema development that should be followed in all formal data resource design.
As more business professionals became involved in data resource design and began using data normalization techniques, many questions arose as to what data were actually being normalized. The traditional technical and mathematical approaches to data normalization were not providing an adequate understanding for business professionals. Those questions were answered with an explanation that the business data schema were being normalized to logical data schema.
However, data normalization of the business data schema did not lead easily or directly to development of the logical data schema, because data normalization essentially split data apart. No formal techniques were available to put similar data together, such as all employee data together, all student data together, and so on. That lack of formal techniques was the beginning of the massive data disparity seen today in many public and private sector organizations.
The problem was resolved with the addition of a data view schema which was the result of data normalization. The individual data view schema are then combined as necessary to ensure that like data, such as employee, student, and so on, were grouped together. That data optimization process was intended to prevent much of the disparate data that were being developed. The resulting sequence was business data schema normalized to data view schema, which are optimized to logical data schema, which are denormalized to physical data schema.
The four schema sequence worked well until distributed data processing emerged, which caused confusion about how data were distributed and denormalized. That confusion was resolved with the addition of a deployment data schema between the logical data schema and the physical data schema. The logical data schema are deployed, through a process called data deoptimization, to the deployment data schema.
The resulting sequence is business data schema, normalized to data view schema, optimized to logical data schema, deoptimized to deployment data schema, and denormalized to physical data schema, as shown in the diagram below. These are the five basic data schema that are used, or should be used, today in formal data resource design.
These five basic data schema worked well for the detail design of an organization’s data resource. However, the lexical challenge regarding the meaning and use of the conceptual data schema still persisted. Conceptual data schema could be the business schema, a generalization of the business schema, a generalization of the logical schema, anything that is an excuse to forge ahead with brute-force-physical database development, and so on. Since data management was beginning to move to the business and conceptual data schema had little meaning, a term needed to be defined that was meaningful to business professionals.
The situation was resolved with the definition of a strategic data schema and a tactical data schema. These terms are very meaningful to business professionals, since they readily understand strategic and tactical activities. Simply, a strategic data schema represents an executive level perspective and a tactical data schema represents a management level perspective.
The strategic and tactical data schema are logical in nature and are based on the organization’s perception of the business world in which they operate. They are placed over the logical data schema, as shown in the diagram below. The strategic data schema can be developed in more detail to produce the tactical data schema, which can be developed in more detail to produce the logical data schema in a process known as data schema specialization. Similarly, logical data schema can be generalized to tactical data schema, which can be generalized to strategic data schema in a process known as data schema generalization.
The diagram shows two broad divisions of data schema. The general data schema include the strategic and tactical data schemas which are general in nature. The detailed data schema include the business, data view, logical, deployment, and physical data schemas which are detailed in nature. The complete arrangement is referred to as the three-tier five-schema concept.
The question always arises whether there should be eight more general data schema representing the business, data view, deployment, and physical data schemas at the tactical and strategic level. The answer is that they would probably be meaningless or less than useful as formal data schemas. Any generalization of the business data schema would be related to business subject areas or business functions. Any generalization of the physical data schema would be related to hardware and system software management, such as clusters of data files or databases. Generalizations of the data view schema and the deployment data schema would have no meaning. Therefore, these data schemas are not defined or developed.
Five detailed and two general data schemas have been formally named and defined above. These seven formal data schema names can be prefixed with the subject area to provide formal data structure names, such as facilities strategic data structure, vehicle tactical data structure, human resource business data structure, employee logical data structure, student physical data structure, and so on. Using these formal data structure names helps both business professionals and data management professionals properly design, develop, and manage the organization’s data resource.
A data model is more than just the data structure. A data model must include formal data names, comprehensive data definitions, and precise data integrity rules. When the data structure is combined with these other three components, the resulting data model is formally named using the same naming conventions. For example, the corresponding data model names would be facilities strategic model, vehicle tactical data model, human resource business data model, employee logical data model, and student physical data model. Note that even though the strategic and tactical data models have less detail, they still contain the data name, data definition, and data integrity rule components.
Data management professionals must formally name the data schema, the data structures, and the complete data models if they ever hope to properly design, develop, and manage an organization’s data resource. They must develop the formal data structure and data model names within a single organization wide data architecture. They must use formal processes to normalize, optimize, deoptimize, and denormalize the data. They must use formal processes for data schema generalization and data schema specialization. To do otherwise leads to confusion, increased data disparity, and a data resource that does not adequately support an organization’s business information demand.