Click to learn more about author Akshay Pore.
In my previous blogs I outlined a framework to perform a fit analysis for choosing the right NoSQL database for your application:
- NoSQL Data Architecture & Data Governance: Everything You Need to Know
- How to Choose the Right NoSQL Database for Your Application?
After choosing a NoSQL database, the next task is to design a schema for the selected database. Designing a schema or developing a data model for NoSQL is a topic of immense debate. While data architects try to design NoSQL through a traditional data modeling paradigm (Conceptual > Logical > Physical Models), developers on the other hand, argue that one of their motivations in choosing NoSQL was that it is schema-less, which is not true. On the contrary, design is of out-most importance when it comes to NoSQL; as these technologies do not provide one-stop schema solution for all use cases such as RDBMS. Hence, there is a real need to develop a standard, yet customized approach to design all types of NoSQL databases by taking into consideration their similarities & differences.
In this article, I attempt to provide a standard approach to design schema for diverse NoSQL data model types. All of these NoSQL types are key: value at its core, distributed & designed for a certain set of use cases. This commonality provides us the basis for standardization, in a sense that all of them should be designed based on application query patterns. This is departure from RDBMS where we used to design for optimizing storage, in case of NoSQL we design for access. Hence, NoSQL follows ‘Query Driven Design’.
Query Driven Design
1. Application Workflows & Query Patterns
Since, NoSQL databases are ‘Schema for Read’, eliciting detailed application query patterns is a critical first step in the Query Driven Design for NoSQL. Schema design for NoSQL usually involves designing Keys, Indexes & Denormalization of attributes, all of which are inter-dependent on the application queries & workflows.
The query requirements elicitation should include following specifications at a minimum:
- Business Data Entities
You can leverage any existing Conceptual Models or Domain Driven Designs to elicit Data Entities & Domains and use them to group User Stories & Query Patterns. This enables us to anchor all query requirements by Entities & their attributes and helps in Data Governance later on.
- User Stories
Specify the business requirement for application which has implications for database.
- Query Patterns
Specify Query Patterns in simple English or pseudo-SQL if you are migrating from RDBMS to NoSQL and have requirement expressed as SQL statements or stored procedures. Pay special attention to the elicitation of query predicates as it provides lot of information on how we approach denormalization & key design. If you see same attribute appearing repeatedly in multiple query patterns, then it can be shortlisted for primary key consideration. A combination of attributes appearing in query patterns indicates possible indexing opportunity.
I recommend leveraging Agile practices such as requirements elicitation & analysis through User Stories for this purpose. A set of user stories should be elicited explicitly for NoSQL design & cross-referenced with application user stories that specify code development for the same business requirement. This helps us in coupling NoSQL schema design with that of the application query patterns & workflows. Since schema design in NoSQL is query driven & queries can change based on requirements, NoSQL design must be revisited and modified iteratively as needed. Hence, Agile approach should be adopted for NoSQL design.
2. Steps to Design Schema
The query patterns, conceptual model & list of attributes elicited above should give us all the ingredients to start with the schema design. Although a physical schema will vary widely based on the NoSQL type & database chosen, we can still standardize the approach based on following steps.
- Design Containers
As different NoSQL databases have custom names to table-like structures, I am calling them containers here. Such containers can be physical constructs such as Collections in case or MongoDB or Document Kinds in case of Couchbase.
The first step is to identify containers based on Query Patterns. A good rule of thumb is to build your containers based on an aggregated set of business rules enabling single functionality. This way, we align the container with a module/class of code written to implement a single function which helps in iterative development & maintenance. The key here is to strike a balance between designing a container for each query vs designing one container to satisfy multiple queries. Hence, Denormalization is of prime importance while designing NoSQL.
- Denormalize Data
You may have read time and again that denormalization is the key to NoSQL design, but how to denormalize data amongst containers is part art and part science. Here, I will try to provide some guidelines to approach this problem.
Denormalization can implemented using techniques such as Embedding/Flatting and Referencing.
- Embedding:For Document Type NoSQL, Embedding refers to denormalizing two entities by nesting one entity into another as a sub-document. The sub-document can be an object, array or an array of objects based on the cardinality between the two entities under consideration.
For Column Oriented NoSQL, such denormalization is carried out by designing attributes from the second entity as a flat list of columns (optionally grouped by Column Families) into the primary entity. The multi-cardinality attributes are designed using special data-types such as lists, sets, maps & optionally embedded JSON structures.
- Referencing:This design pattern is more normalized as it refers to storing primary key of one entity as a reference into another entity. But since most NoSQL does not support database joins, this reference is not same as PK-FK relationship as there are no constraints.
The purpose of the reference here is to enable application side joins, where the application shall query the second entity using the primary key of the first entity. This usually is a multiple-step process without any transaction benefits.
These Embedding & Referencing techniques can be used to implement core design patterns as follows:
- 1-1 Relationship
- Identify primary entity based on elicited query pattern
- Denormalize by embedding child entity in the parent entity as a sub-document object or a flat list of attributes
- 1-N Relationship
- Identify the primary entity: 1-side or N-side; based on the query pattern
- Choose Embedding if there are FEW attributes on the N-side entity & are FREQUENTLY accessed together:
- If 1 is the primary entity, embed attributes from N-side entity as an array of sub-docs (Doc-Type NoSQL) or using multi-cardinality data types (Column Oriented NoSQL)
- If N is the primary entity, embed attributes from 1-side as a sub-document object or a flat list of attributes
- Choose Referencing if there are MANY attributes in N-side entity & are RARELY accessed together:
- If 1 is the primary entity, include references to the primary key of N-side entity as an array or multi-cardinality data type like map
- If N is the primary entity, include reference to the primary key of 1-side entity as an attribute
- M-N Relationship
- Identify primary entity based on the query pattern
- Choose embedding or referencing based on guidelines from 1-N pattern
- If the app requires data both ways M-N and N-M, create 1 container for each pattern
- Design Primary Keys
As I mentioned above, NoSQL databases are key: value at their core and they are distributed, which means primary key tends to be the fastest way to locate your data on the cluster & access it. Additionally, NoSQL have limited to no support for database joins, views & stored procedures and secondary indexes come with some cost. Hence, designing efficient Primary Keys is even more important in context of NoSQL.
Due to differences in NoSQL database architectures, it is imperative to understand how a particular database implements its Primary Keys. Knowing underlying mechanisms is critical in efficient design.
For Example, Column Oriented NoSQL databases such as Cassandra & HBase have Row Keys as Primary Keys. But the Primary Key for Cassandra takes the form of a composite key comprising of a Partition Key (K) + Clustering Column (C) where K locates the data on the cluster and is stored as a hash-map whereas C is a sorted Map providing uniqueness to the entire Primary Key. Hence, the key is actually implemented as nested sorted map of maps.
Another example is Couchbase which is a Document Type database, it does not have a concept of physical table, documents are stored directly in a bucket (like a database) but can be grouped using logical Document Kinds identified by value of a type field. The Document Key is designed using a concatenated string of values of type & business key fields and is stored as a separate hash-map on the cluster. Hence, understanding the implementation of Primary Keys is the first step in designing them efficiently.
I have listed a few design considerations for NoSQL Primary Keys as follows:
- Since Primary Keys are fastest way to access NoSQL data, use business relevant attributes as Primary Keys. Using Surrogate Keys or Timestamps should be minimized as you may not know the value of such key during reads to do a K: V lookup.
- Analyze your query pattern requirements and shortlist attributes that appear frequently as predicates. Examine such attributes to determine if they would provide uniqueness, if yes, consider them as candidates for Primary Keys.
- Categorize you query patterns into high SLA/critical queries and secondary/data mining queries and design Primary Keys to satisfy the critical queries.
- In cases where joining Containers is unavoidable, design Primary Keys to enable application side joins. For Example, instead of using heavy indexes, design lookup container storing references for multiple entities and then using Primary Keys to access individual entities. The application can then access the lookup Container using PK access, extract the list of references and use them as PKs to access other Containers in a multi-threaded fashion. Such access will still be faster than one Container with multiple complex Indexes.
- Design Indexes
Most of NoSQL databases provide secondary Indexes with some exceptions such as HBase. Different NoSQL database provide different kind of Indexes such as regular B-Tree, Array, Geo-spatial & Full Text Search/inverted Indexes. Hence, finding what indexes are available for the NoSQL database you are designing helps. Regardless of the type, Indexes should be designed to implement non-critical/data mining-like queries.
Here are some of Index design considerations:
- Examine the query patterns and create a consolidated list of attributes listed as predicates for queries belonging to same container. Use this list to design efficient composite covering Indexes. Try to avoid creating individual indexes for each query. This is especially important in context of iterative/agile nature of NoSQL modeling as more query patterns tend to emerge during consecutive sprints, and instead of re-designing existing Indexes, more individual Indexes are added overtime. Such practice builds a lot of pressure on Indexers and database performance degrades over the course of the project.
- Design Array Indexes only when all or most of the attributes in the array are required for the query. If you need only one or two attributes in the array, consider moving such attributes outside of the array or duplicate such attributes outside the array and design a simple/composite Index of them. Try to keep size of the array to a minimum if you plan to Index it.
- If you require special Indexes such as FTS (Full Text Search), try to avoid such Indexes on attributes with complex datatypes such as arrays. Such Indexes take 2-3x space so pay attention to the Index size as well. If you require FTS enabled on multiple attributes For Example Address, City, State etc., design a derived attribute with a concatenated value from all of the above attributes such as Full Address and create FTS on it. Derived attributes can also be designed for other cases as well, such as attributes requiring FTS as well as value searches or aggregations etc.
In summary, designing NoSQL is very different that designing RDBMS as the later has a set of standard database agnostic steps in form of Normalization & Data Modeling paradigms to guide the design process. This is my attempt to provide some standard guidelines to look for while designing different types of NoSQL databases but in-short when asked How to design schema your NoSQL? the answer for the most part remains…It depends!
*NOTE: Thoughts expressed in this blog are the author’s and not his present or previous employer’s.