Click to learn more about author John Felahi.
In this second blog of a two-part series on Multi-Data Lake Management, we continue to best answer the question of: “How can I make everything work together today?”
In Part One on this subject, “Five Essentials to Multi-Data Lake Management: Synchronizing Everything Together” we discussed coordinating all the elements together. In this part, we continue with the rest of the primary objectives of our discussion. At this point, we’ll pick up where we left off – which was the exploration of Multi-Data Lake environments, which we feel fall under the categories listed below. The first point was also discussed in Part One, but we include it here to help complete the discussion:
#1: Trusted Movement of Information and Objects from Development to Production
The fundamental tenet of operating with multi-lake environments is the ability to seamlessly and securely move data between development and production. Within most businesses, quickly operationalizing the data, metadata and associated jobs including transformations and workflows is too often an unattainable goal. With the development team busy testing new process “objects” including obfuscation, PII detention, job schedules, transformations, the ability “promote” them to production for operational use must be easy.
The key aspects to make movement from development to production simple is to ensure the solution or application can import and export objects and data (again with its metadata). In addition, it must understand environment variables, updating some of the logic as an object and user security. Finally, there must be a way to work across development to production environments that may not be in the same lake – but are built for different teams doing different work.
#2: Expand to Multi-Lake Environment for Production
Most businesses, if not underway already, are planning for multiple production Data Lakes. Managing by divisions or geographies are the typical segmentation strategies for Multi-Data Lake deployments. For example, a large global pharmaceutical company took on the challenge of answering “How do we integrate our hugely successful Data Lake built by our US team into our development Data Lakes in Europe and Japan, accelerating them through testing to deployment under their own production environments? It had to be appropriately governed and secured to make this journey. Even with data that can benefit the business yet is prohibited from being “moved” across borders due to legal implications, rich metadata tagging makes sharing data, without moving, possible.
The critical aspect to supporting multi-lake production environments is to ensure data is business-ready, serving multiple use cases for BI/Analytics tools, delivering faster time to insight. The ability to move information quickly and within necessary constraints drives the expansion of insights beyond one Data Lake to the entire enterprise.
“451 Research believes that the need for data to be filtered, processed, treated and managed to make it suitable for multiple analytics use cases is critical to delivering value from the Data Lake,” said Matt Aslett, research director, data platforms and analytics, 451 Research. “Data governance and self-service data preparation are key elements of functional Data Lakes and associated data marketplaces, with machine learning-driven insights and recommendations an increasingly important aspect of accelerating the generation of value from enterprise data.” –Beyond The Data Lake: The Rise of Managed Self-Service and the Data Marketplace
#3: Security Taken to the Next Level
Understanding the security requirements and executing on them to share objects and data across Data Lakes is more complex in a multi-lake environment. Working effectively with Active Directory (AD) or LDAP and the Hadoop distribution security frameworks (Sentry, Ranger, etc.) while defining user roles within the application (Podium for example) is the key to establishing appropriate processes. Once done, traditional IT management can easily define and control what moves and who can move it – or at least view it. In today’s world where cyber security is top of mind, secure access has never been more critical and needs to be thoroughly planned and executed. Understanding the criticality of “time to answers” from data, security must be balanced with the ability to appropriately access and use resources to achieve business goals.
#4: Cutting Storage Costs in a Smart and Planned Way
The promise of inexpensive data storage, in the distributed world of Hadoop, is no doubt a reality compared to traditional methods. However, the cost is not free and with higher volumes of data, IT teams remain sensitive to storage costs.
There are a number of approaches to cost efficiencies within Data Management, but few are inherently designed as part of a comprehensive Enterprise Data Management in the Data Lake. One approach leverages its Data Conductor capability to do this.
This capability establishes “levels” of data to be available at any one time in the data lifecycle. For example, data can be managed when it needs to be available for months at a time for frequent analysis by business users. Once the data is still of interest, but no longer necessary for active management, it can become registered with just the profiling statistics, metadata, and some sample data around while the data itself remains in the source system (ERP, SFA, weblogs, etc.). Finally, data can be addressed that just describes the schema (what fields the source system has available) in the Data Lake without copying the data so business data consumers can find data of potential use, and ask for it. Now it becomes simple for IT to plan and manage how many copies of the data reside within the Data Lake to optimize storage capacity and costs.
#5: Audit Logs
Traceability is another concept that increases in complexity as well as importance when part of a Multi-Data Lake enterprise. The questions however remain the same – Who worked with the data? When? Where did they move it to, and from where? Tracking all these in audit logs or as part of data lineage (depending on the actions) must be part of multi-lake strategy and planning to ensure fast answers on regulatory compliance questions.
In Summary
As the Data Lake concept matures and enterprises deploy distributed data repositories at scale, the conversation is shifting. Given the size and complexity of today’s enterprises, it is clear that managing a single Data Lake is not enough as critical business data will never all live in one place. Thus, the planning and execution of multi-Data Lake management has become a prevalent part of the Big Data conversation. Traditionally, special applications did this across big iron / mainframe era, but now a new class of applications are needed to securely manage and process data moving into – or living in – Data Lakes.