One organization’s Data Lake may very well be someone else’s Data Swamp. The difference lies in how data is curated. What does this mean? For starters, a Data Lake describes where vast amount of data of various types and structures can be ingested, stored, assessed, and analyzed. Specifically, Data Lakes make it easy for Data Scientists to mine and analyze data, to require minimal transformation if any, to facilitate automated pattern identification, and is a good online archive.
A Data Swamp, in contrast, has little organization or no system. Data Swamps have no curation, including little to no active management throughout the data life cycle and little to no contextual metadata and Data Governance. Data Swamps have the problem of being of little use or unusable and frustrating.
Kimberly Nevala notes in her DATAVERSITY® article, In Defense of the Data Swamp, users drown in Data Lakes because they are “uniformly uniform.” Blindly clean up a Data Lake, and you risk it’s failing and floundering. However, choose not to clean up a Data Lake, and you risk missing out on business analysis, Machine Learning and analysis of Dark Data. In either case, an organization’s Data Lake can become a time and money sink. In Can Failed Data Lakes Succeed as Data Marketplaces, Dan Woods states:
“With the Data Lake, while companies can store massive amounts and varieties of data, they have been unable to effectively manage that data and allow a large number of people with moderate expertise levels to explore the data, come up with useful queries, extract the signal through some regular production process that becomes part of the way a business runs.”
To move forward with useful Data Lakes, companies need to have a different dialog about Data Lakes vs. Data Swamps — one that revisits the similarities and differences between the two.
Similarities of Data Lakes and Data Swamps: NoSQL
Peer into a Data Lake and a Data Swamp and see NoSQL, the technology responsible for the benefits of a Data Lake. These advantages include easy collection and ingestion, great variety, as well as faster data preparation. While NoSQL cannot be easily queried and are not advisable for daily, set operations, it does contain diamonds in the rough. NoSQL results from “a complex configuration of data handling tools including Hadoop or other Data storage system, cluster services, data transformation and data integration, ”by default, Cloud migration and streaming become a default ingest point for data and initially, a de facto Data Lake. Advantages for NoSQL, and thus the Data Lake, include:
- Schema on Read Architecture: Organizes Data after querying. This allows consideration of multiple data attributes at once and the flexibility to ask ambiguous business-driven questions.
- Low Cost Storage: Data Lakes attract businesses because they are an affordable way to store a growing amount of Big Data, especially with machine data becoming prolific from a variety of sensors (IOT).
- Scalability and Agility: Data Lakes can take advantage of distributed file systems for storage and are thus highly scalable. The use of open source technologies also reduces storage costs. Based on structure, they are inherently more flexible and Agile.
But all these advantages can also describe a Data Swamp. A badly designed, inadequately documented, poorly maintained Data Swamp also has schema on read architecture, low-cost storage, scalability, and agility, with the drawback that Data Scientists and other users cannot analyze and exploit this data effectively.
However, getting too fixated on the drawbacks of a Data Swamp and over-cleaning a Data Lake negates the advantages of each. Enacting a very rigid Data Governance over who or what can add to the Data Lake may mean it becomes inaccessible to people and unlikely to be supplied. Add a too granular metadata structure with a more detailed catalog and lose data variety to cumbersome cataloging, and waste Data Scientists’ time wading through metadata. At that point, wouldn’t a Data Warehouse be better than the clearest Data Lake?
Essentially ignoring or not making use of NoSQL advantages means a Bottlenecked Data Lake. The Data Lake is treated as a next generation Data Warehouse technology rather than a new approach to data. Companies try to find a way to operationalize this Data Lake with a technology not meant for a mission-critical computing infrastructure that radiates value the same way a Data Warehouse does.
Differences: Oversaturation vs. Organization
While Data Lakes and Data Swamps share the flexibility of NoSQL, Data Swamps become oversaturated with lots of data. As departments and various users offload non-curated data into the Data Lake, a series of “standing pools” of data emerge. These cesspools grow murky with an unknown number of data types that can’t readily integrate with each other to produce insights. As the data disconnection continues, with increased number of projects and tools, data becomes lost and the Data Lake becomes abandoned and a headache. In How to Prevent a Data Lake from Becoming a Data Swamp, Andrew Brust compares this kind of oversaturation and lack of organization to a personal non-intuitive file system where users cannot locate what they need through Windows Explorer or Apple finder. Too much data without the appropriate organization characterizes the Data Swamp.
Luckily, many Data Swamps can be cleaned by using Data Curation and Data Governance to organize data sets, but not to the point of over organization that results in in bottlenecks. Structuring a Data Lake appropriately includes:
- Prioritization through Data Governance: According to Shannon Fuller, the Director of Data Governance at Carolinas Health System, knowing what your priorities are is the key toward implementing an efficient structure for the Data Lake, through Data Governance. By understanding what the organization was trying to do with the Data Lake, the focal points and desired results, Fuller implemented a Data Lake that worked for the Carolinas Health Care System while protecting patient information and intellectual repository, but ensuring a common repository and innovation.
The Data Lake created, maintained, and used by the Carolinas Health Care System may be considered a Data Swamp by the company Decision Resources Group, which supplies reports on companies, drugs, and diseases. In this case, tracking pharmaceutical usage for off-label use has more value to the Group than does Carolina’s Health Care System’s need of exploring how doctors tend to prescribe medications. Decision Resources Group would probably find Carolina’s Health Care System’s Data Lake a Data Swamp. Since Decision Resources operates in a different context, it would find segments of oversaturated data as well as bottlenecks.
- Curating Contextual Metadata: As Kathy Rondon states, Data Curation is about “contextual metadata.” To that end, data assets themselves never need to be centralized, stored, or accessed in a single repository, such as a Data Lake. Rondon recommends a Federated System with smaller units that can maintain Authority over smaller repositories. She talks about the Open Archival Information System (OASIS) as a template.
The OASIS system chunks multiple data files and packages them logically and physically as an entity (similar to a book or a folder on a computer). Creating metadata about these chunks or containers of files, such as a primary schema, allows for effective Metadata Management of a particular business’ data sets without getting bogged down in systematic details and individual data files that potentially bottleneck a Data Lake. Every piece of data may not be known in these data subsets, but the contextual metadata about the set gives a Data Analyst enough information to know in general what is inside. While this means assuming some level of data saturation where exploring outside the business context may be difficult, the Data Lake retains scalability and flexibility advantages within the business context.
Conclusion
Businesses need a Data Lake that retains the advantages of schema on read, low cost storage, flexibility, and agility. These characteristics can be found in both Data Swamps and Data Lakes. However, an oversaturated Data Lake does not lend to insight and analysis. A Data Lake needs Data Governance and Data Curation organizations able to prioritize data and to create the best business context for exploring data sets. So, simply avoiding a Data Swamp will not necessarily make for a useful Data Lake. Knowing how a Data Lake fits into the Data Strategy (i.e. focal points and desired results) in addition to providing enough contextual metadata will assist in limiting the risk of the Data Swamp aspects of a Data Lake, while allowing the flexibility for data analysis and new insights.
Photo Credit: kentoh/Shutterstock.com