Advertisement

What Is a Data Lake?

By on
data lake

A data lake is an environment where a vast amount of data, of various types and structures, can be ingested, stored, assessed, and analyzed. Data lake technologies can scale to massive volumes of data, and combining datasets is easy with data stored in a relatively raw form.

A data lake architecture can centralize data over distributed storage, providing a scalable, fast, secure, and economical solution.

Data lakes serve many purposes, including:

  • An environment for data scientists to mine and analyze vast amounts of raw, structured, and unstructured data
  • A central storage area for raw data, with minimal (if any) transformation
  • Alternate storage for a detailed historical data warehouse
  • An online archive for records
  • An environment to ingest streaming data with automated pattern identification

Other Definitions of a Data Lake Include:

  • “A collection of storage instances of various data assets additional to the originating data sources.” (Kelle O’Neal)
  • A technology that “allows raw, structured, and unstructured data to reside in one repository and enables comprehensive analysis of big and small data from a single location” (Paramita (Guha) Gosh)
  • “A pool of unstructured and structured data, stored as-is, without a specific purpose in mind.” (Amber Lee Dennis)
  • “A storage repository that holds a vast amount of raw data in its native format until it is needed for analytics applications.” (TechTarget)
  • A place where “unstructured/prestructured data resides.” (Harvard Business Review)
  • An affordable way to “store big data in near limitless amount.” (Forbes)

Data Lake Use Cases Include:

  • Have a data system to “support innovation and insights in health care service delivery”
  • Share data across discrete corporate divisions to “increase research and operational efficiency, escalate output, and accelerate drug research.”

Businesses Use Data Lakes to:

  • Find and act on business opportunities
  • Stimulate innovation
  • Lower infrastructure and maintenance costs 
  • Store data on the cloud
  • Pipe different data from one storage area to another
  • Provide a central Data Management system for big data and over-distributed storage
  • Deal with complex and diversified data
  • Meet business demands of more insights, agility, and flexibility
  • Store different types of data in their original formats until they need to be structured and analyzed

Image used under license from Shutterstock.com

Leave a Reply