Click to learn more about author Michael Blaha.
Identity is the means for denoting individual entities, so that they can be found. Identity is a prominent concern for databases because developers must have some way for referring to things. There are two aspects of identity – how to connect records together (internal identity) and how to obtain data in responding to queries (external identity).
Internal Identity
Internal identity deals with connections among records. Primary keys identify records within tables and foreign keys bind records together. A primary key is a unique combination of one or more attributes that is used to preferentially access the records in a table. A foreign key is a reference to a primary key. There are two basic approaches to primary keys in a data model.
- Surrogate keys. A system-generated attribute is added to each entity type table and made the primary key. This attribute is often an artificial number. The primary key for each relationship table consists of identifiers from the related entity types.
Surrogate keys have the advantage of being a single attribute, small, and uniform in size. Most relational database managers provide ID generators and allocate identifiers efficiently. Since the primary key is synthetic, it is immutable and there are no updates to foreign key references.
However, surrogate keys can make a database more difficult to read during debugging and maintenance. IDs also complicate database mergers; ID values may contend and need to be reassigned. Surrogate keys should only be used internally in applications and not be displayed to users. Unique combinations of application attributes can still (and should) be enforced — that is the purpose of alternate keys.
- Natural keys. A unique combination of application attributes is used to identify each entity. The primary key for each relationship table consists of primary keys from the related entity types.
Natural keys have different trade-offs. Primary keys have intrinsic meaning, making it easier to debug the database. A downside is that the value of application attributes can change — such changes must propagate to foreign keys. Some entity types lack unique application attributes. Some models have a series of dependent entity types that lead to unwieldy multi-attribute primary keys.
We recommend that you use surrogate keys for relational databases with more than 30 entity types. The resulting uniformity and simplicity outweighs any additional debugging effort. Both surrogate keys and natural keys are viable options for small applications. Data warehouses should always use surrogate keys.
External Identity
External identity is the ability to start from outside a database, specify the value of one or more attributes, and find data. Alternate keys are important because they are unique application values. Queries often start with alternate keys, natural keys, or sometimes non-unique values, and navigate the connections of internal identity to find desired data.
In SQL queries most joins traverse the connections of internal identity, binding foreign keys to their primary keys. In contrast where clauses specify external identity – the starting points of attributes for traversal – and other constraints.
In Conclusion
Pay close attention to identity as you construct data models. Normally, you should use a surrogate key as the primary key of each entity type table. It’s also important to define referential integrity so that your foreign key references really do exist. Otherwise you may have dangling references that cause you to lose data in queries. It’s also important to define alternate keys. Alternate keys provide entry points to a data model so that users can specify application values and obtain the data that they need.