Data Lakes are consolidated, centralized storage areas for raw, unstructured, semi-structured, and structured data, taken from multiple sources and lacking a predefined schema. Data Lakes have been created to save data that “may have value.” The value of data and the insights that can be gained from it are unknowns and can vary with the questions being asked and the research being done.
It should be noted that without a screening process, Data Lakes can support “data hoarding.” A poorly organized Data Lake is referred to as a Data Swamp.
Data Lakes allow Data Scientists to mine and analyze large amounts of Big Data. Big Data, which was used for years without an official name, was labeled by Roger Magoulas in 2005. He was describing a large amount of data that seemed impossible to manage or research using the traditional SQL tools available at the time. Hadoop (2008) provided the search engine needed for locating and processing unstructured data on a massive scale, opening the door for Big Data research.
In October of 2010, James Dixon, founder and former CTO of Pentaho, came up with the term “Data Lake.” Dixon argued Data Marts come with several problems, ranging from size restrictions to narrow research parameters. In describing his concept of a Data Lake, he said:
“If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
Data Marts
In the early 1970s, ACNielsen offered their clients a Data Mart to store information digitally and enhance their sales efforts. A “Data Mart” is an archive of stored, normally structured data, typically used and controlled by a specific community or department. It is normally smaller and more focused than a Data Warehouse and, currently, is often a subdivision of Data Warehouses. Data Marts were the first evolutionary step in the physical reality of Data Warehouses and Data Lakes.
At present, there are three basic types of Data Marts:
- Independent Data Marts are not part of a Data Warehouse, and very similar to the original Data Mart offered by ACNielson. They are typically focused on one area of business or subject area. Data can be taken from both external and internal sources. It is then translated, processed, and loaded into the Data Mart, where it is stored until needed.
- Dependent Data Marts are built into an existing Data Warehouse. A top-down approach is used, supporting the storage of all data in a centralized location. A clearly defined section of data is then selected for purposes of research.
- Hybrid Data Marts combine the data taken from a Data Warehouse and “other” data sources. This can be useful in a variety of situations, including providing the ad hoc integration with a new group, or product, which has been added to an organization. Hybrid Data Marts are well-suited for multiple database environments and provide fast implementation turnaround. These systems make data cleansing easy, and work well with smaller data-centric applications.
Data Silos
Data Silos are part of a Data Warehouse and similar to Data Marts, but much more isolated. Data Silos are insulated management systems that cannot work with other systems. A Data Silo contains fixed data that is controlled by one department and is cut off from other parts of the organization. They tend to form within large organizations due to the different goals and priorities of individual departments. Data Silos also form when departments compete with one another instead of working as a team toward common business goals.
A few decades ago, storing a customer’s data in a silo was considered a good idea. At the time (late 1980s and early 1990s), silos were evolving alongside new technologies, such as social media and email service provider tools, and the additional security of near-total isolation seemed reasonable.
Data Silos often store “incompatible data” that is considered important enough to translate later. (Data Marts often only contain translated data.) For many organizations, a significant amount of data was stored for later translation. Eventually, Data Silos became useful as a data source for the processing of Big Data.
The Business Dictionary describes a “silo mentality” as a mindset that exists when departments or sectors within an organization decide they do not want to share their information with the rest of the organization. The results of this behavior are generally considered to have a negative impact on organizations. Two in-house silos storing the same data may have differing content, causing confusion about accuracy and questioning the age of the data in at least one of the silos. While a silo mentality can provide excellent security, Data Silos have been criticized for impeding productivity, and negatively impacting data integrity.
Data Warehouses
Though Bill Inmon presented the concept of Data Warehousing in the 1970s, the Data Warehouse’s architecture wasn’t developed until the 1980s. Data Warehouses are centralized repositories of information that can be researched for purposes of making better informed decisions. The data comes from a wide range of sources and is often unstructured. Data is accessed through the use of business intelligence tools, SQL clients, and other Analytics applications. A Data Warehouse is often built into an organization’s mainframe server or located in the Cloud.
The standard Extract, Transform, and Load-based Data Warehouse employs Data Integration, staging, and access layers in its key functions. The staging layer stores raw data taken from different data sources. The integration layer merges the data by translating it and moving it to an operational data store database. This data is then moved to the Data Warehouse database, where it is organized into hierarchical groups (called “dimensions”), facts, and aggregate facts. The access layer lets users retrieve the translated and organized data.
Data Lakes and the Cloud
“The Cloud” is a term describing hosted services available over the internet. The Cloud allows organizations to use computer resources as a utility similar to electricity, rather than building and maintaining in-house computing infrastructures. Data Lakes are available in the Cloud.
At present, Data Lakes can be used in a large variety of environments, including the Cloud. As the use of Cloud-based data services has grown, Cloud-based Data Lakes have begun to look very much like their in-house counterparts. The benefits of transferring an in-house Data Lake to the Cloud can include:
- Processing and storage services within the Cloud can easily be scaled up or down, allowing customers to scale storage without the need of physically adding more computer memory.
- A pay-per-use model combined with the ability to scale up and down means resources can be added as needed during peak loads, and then scaled back during slower times.
- Infrastructure management and maintenance costs are reduced dramatically by transferring to a Cloud-based service.
Most of hosted Cloud storage uses an object-storage architecture. Examples include Amazon Web Services S3 (March 2006), Rackspace Files (its code was donated in 2010 to Openstack project and released as OpenStack Swift), and Google Cloud Storage (May 2010).
Object stores are a decades-old technology, but have scalability advantages and are very effective for storing diverse data types. Object stores have been used traditionally for Big Data storage and are often used for storing unstructured data (pictures, movies, music).
The Cloud’s storage and data services are continuously upgraded to meet the needs of modern Data Lake architecture, and it is quite reasonable to expect the number of Cloud-based Data Lakes to grow. The next challenge in Data Lake architecture will be finding new ways to gain insights from these Data Lakes.
Image used under license from Shutterstock.com