Click here to learn more about author Michael Blaha.
This is the second of a two-part series about data model quality. This blog discusses quality for data warehouses. The prior month’s blog discussed quality for operational applications.
Data warehouses have a much different architecture and different business motivations than operational applications. For example, operational applications manage the day-to-day data needed to support the business. In contrast, data warehouse restructure operational data and place it in a format amenable to data mining and deep analysis. Operational applications rapidly read and write transactions with small amounts of data. In contrast, users only read data warehouses and can have extensive queries involving large data sets running for multiple minutes.
Here are some critical factors for a high-quality data warehouse data model.
- Adherence to the data warehouse metaphor. Generally a warehouse should place descriptive data in dimensions and numeric measures of business performance in facts. Then business queries can summarize, slice, and dice fact data.
- Use of conformed dimensions. A conformed dimension has a consistent meaning throughout a data warehouse. It integrates comparable dimensions from different applications into a single concept. Conformed dimensions increase data consistency and enable queries across facts.
- Use of dimension roles. Define a dimension once and distinguish multiple references with role names. Do not define a new dimension for each reference to a concept or thing. For example, a fact table can have an order date and a pick up date, both referring to the same Date dimension.
- Minimal snowflaking. A snowflake is a network of tables that partially or fully normalize a dimension. Normalization is a virtue with operational applications, but often a vice with data warehouses. Snowflakes run counter to the data warehouse principle of denormalizing dimensions to simplify data structure and improve performance. A snowflake can reduce disc space. However, the savings are usually negligible as most data warehouses are dominated by fact disc space. Consider snowflakes only for monster dimensions with many records that consume 10 percent or more of the warehouse disc space.
- Inclusion of the atomic grain. The atomic grain is the lowest level of detail possible for a fact. Always include the atomic grain, even when there is summary data to speed performance. Data at the atomic grain can logically answer any business query. Aggregate grains are purely a performance optimization.
- Fully documented codes. Make sure that encoded dimension attributes are supplemented with verbose attributes that have meaningful text. Dimension codes can be important to some business staff. But they are often obtuse and incomprehensible to others. You can have it both ways by storing codes in dimensions along with verbose attributes that elaborate.
- Consistent use of surrogate keys. A surrogate key is a unique, meaningless identifier that is a primary key. A warehouse should define a surrogate key for each dimension. We also recommend that each fact have a surrogate key.
Surrogate keys differ from natural keys. A natural key is the identifier of a table from an operational application. A data warehouse stores natural key attributes but treats them as ordinary data. A data warehouse has surrogate keys in addition to natural keys because dimension data can change over time. For example, a person’s address can change. This can lead to multiple records of data and multiple values of a surrogate key for the same natural key. It is a mistake to base the primary key of a warehouse table on a natural key.
- Selective use of slowly changing dimensions. A slowly changing dimension stores a history of values. Use slowly changing dimensions only for dimensions whose data changes over time. You complicate a warehouse and slow performance if you define slowly changing dimensions where they are not needed.
- Reasonable number of dimensions per fact. Be wary of facts with too few, or too many dimensions. Ralph Kimball advises that a data warehouse should normally have 5–15 dimensions per fact. This is a guideline and not a hard and fast rule. A small number of dimensions may indicate that a warehouse is lacking detail. A large number of dimensions often is a sign that some dimensions are not completely independent and could be combined into a single dimension.
In Summary
All data models should be built with quality in mind, be they operational, data warehouse, or other kinds of models. A data model must not only satisfy business requirements but should express them in a manner conducive to fast development, efficient execution, and ability to respond to business changes.