Click to learn more about author John Felahi.
Data is the New OS “From a technology perspective, every employee should have access to the data that matters most to their jobs and should be empowered to share it with others across a business.”
Multi-Data Lake Management is at the top of many minds at numerous data conferences these days. And why not. As more and more companies find themselves with multiple Data Lake instances, the questions of how best to coordination and synchronization those platforms is paramount. While some of the emerging multi-lake conversation swirls speculatively around which of the big players are offering the smartest unifying architecture vision – pragmatic thinkers are often focused more on the here-and-now.
In this first blog – in a two-part series – we look to best answer the question: “How can I make everything work together today?”
In recent years, multiple organizations within companies have independently built their own Data Lake, and few have paused to proactively plan how respective lakes fit together in a wider Big Data Management plan. Most teams have not asked where each enterprise data set should be hosted across lakes. Nor has there been a lot of attention to developing a coordinated plan to synchronize data across systems or establish a consistent data definition and mapping process that’s in line with traditional Data Governance models. As a result, in many multi-lake organizations, inefficiencies, redundancy and disorganized Data Management are prominent.
Most businesses are split into working units by divisions or geography and, quite often, these units move to a Data Lake at different paces, or in some cases not at all. This means critical business data is never going to live in one place making the paradigm of managing data at the enterprise level one of increased complexity. In order to thrive, businesses need to plan and execute on optimizing “time to answers” demanding more from their data – more insights, more agility, and more flexibility.
The technology exists to meet this need while supporting the ability to manage data throughout the enterprise ecosystem; a single, integrated environment with data producers and consumers throughout the business interacting together like never before. As the business must adapt to thrive continuously evolve to enable businesses to go on the offensive.
Data Lakes continue to gain in numbers and are still-evolving the definition of a lake itself now encompasses multi-lake environments, which we feel fall under these categories:
#1: Trusted Movement of Information and Objects From Development to Production
The fundamental tenet of operating with multi-lake environments is the ability to seamlessly and securely move data between development and production. Within most businesses, quickly operationalizing the data, metadata, and associated jobs including transformations and workflows is too often unattainable goal. With the development team busy testing new process “objects” including obfuscation, PII detention, job schedules, transformations, the ability “promote” them to production for operational use must be easy.
The key aspects to make movement from development to product simple is to ensure the solution or application can import and export objects and data (again with its metadata). In addition, it must understand environment variables, updating some of the logic as an object and user security.
Finally, there must be a way to work across development to production environments that may not be in the same lake – but are built for different teams doing different work.
In Part Two of this series, we’ll conclude with further exploration on the four remaining tenets of this subject, which include:
#2: Expand to Multi-Lake Environment for Production
Touching on the challenges of multiple production lakes, and dynamic forces such as business divisions, globally dispersed data and appropriately governed and secured information.
#3: Security Taken to the Next Level
Delivering self-service data is complex and understanding the security requirements and executing on them must be balanced with the ability to appropriately access and use resources to speed “time to answers.”
#4: Cutting Storage Costs in a Smart and Planned Way
The promise of inexpensive data storage in the distributed world of Hadoop, is no doubt a reality compared to traditional methods. However, the cost is not free and with higher volumes of data, IT teams remain sensitive to storage costs.
#5: Audit Logs
Traceability is another concept that increases in complexity as well as importance when part of a Multi-Data Lake enterprise.