Click to learn more about author Brian Lachance.
Enterprises migrating on-prem data environments to the cloud in pursuit of more robust, flexible, and integrated analytics and AI/ML capabilities are fueling a surge in cloud data lake implementations. The rationale is justified: Compared with legacy on-prem infrastructure, cloud data lakes – if implemented correctly – promise scalability and agility gains while lowering costs.
But a migration is only as strong as its underlying security mechanisms, and cloud data lakes bring a security strategy that must be understood and constantly maintained. Components including the cloud platform, object storage, multiple data processing engines, and modern analytics tools that cloud data lakes leverage in a singular environment each can carry risks of exploitation and data exposure by attackers if integrated improperly.
This shouldn’t scare organizations off from cloud data lake migrations and modernization projects. Cloud data lake security practices have advanced considerably over the past couple years. When wielded with expertise, end-to-end cloud data lake security and compliance can be effectively managed to run production-ready workloads, either for those fortunate enough to have requisite DIY knowledge in-house or through third parties responsible for execution and operation.
To that end, here are 10 cloud data lake security best practices that help manage risk and promote continuous visibility for deployment monitoring and protection:
1. Isolate security functions
As a foundational best practice, security functionality should be separated from nonsecurity functions and user access should be restricted to the absolute least privilege necessary. In the context of cloud data lake security, this means limiting roles on both the cloud and data lake platforms, and ensuring only experienced security personnel can alter cloud security controls. A recent report from DivvyCloud highlighted misconfiguration and user inexperience as particularly key breach risks. Security function isolation and cloud expertise are essential to mitigate risk.
2. Harden the cloud platform
Start from a unique cloud account to harden and isolate your cloud data lake deployment. For example, those on AWS can leverage the AWS Organizations service to easily create and manage a new account. With a unique cloud account running your data lake, implement hardening protections in-line with CIS Benchmarks. These guidelines include carefully applied configuration settings that support account security. Using a unique hardened account increases security by providing logical data separation from your other cloud services.
3. Secure the network perimeter
The secure network perimeter you design for your cloud data lake deployment constitutes its first line of defense. The method you select must account for your specific circumstances. Key compliance or bandwidth requirements may very well mean that a private connection or cloud-based VPN is needed. If any sensitive data is stored in the cloud and non-private connections are allowed, a firewall becomes crucial for maintaining traffic control and visibility.
Leverage a third-party next-generation firewall available through your cloud platform marketplace. These firewalls offer advanced features like intrusion prevention, application awareness, and threat intelligence, and generally complement native cloud security tools.
You can effectively secure and ensure consistent compliance across all your cloud environments by utilizing these firewalls in a hub-and-spoke configuration. Throughout your cloud infrastructure environments, only firewalls should have public IP addresses. Limit unauthorized access and data exfiltration risks with strong ingress and egress policies featuring intrusion prevention profiles.
4. Implement host-based security
Too often overlooked in cloud platforms, host-based security defends the host and stands as a final layer of data protection and against attacks. Host security is a broad endeavor and must adapt to specific service and function use cases.
Host intrusion detection is a key component of host-based security. An agent running on the host detects suspicious activity, based on either known threat signatures or behavioral anomalies, and sends alerts to administrators of the unusual event. Machine learning algorithms are also being introduced in hybrid host-based intrusion detection and, when combined with either threat- or anomaly-based systems, can offer even higher detection rates.
File integrity monitoring (FIM) tracks any file changes in the cloud environment, effectively detecting and tracking the progress of attacks. Attackers use exploits that escalate their privileges within the cloud environment by corrupting a series of files or services. FIM solutions recognize those changes to thwart such attacks. Many can also restore corrupted files. FIM capabilities are often required in order to satisfy regulatory compliance.
Log management is yet another vital security practice that will need attention. Analysis of logged events provides a key mechanism for investigating security incidents. For this reason, process, procedures, and controls around log storage, retention, and deletion should be carefully designed to meet your security framework or regulatory compliance requirements. Many available log management tools are designed for integration with cloud-based solutions (such as AWS CloudWatch, to continue with AWS as my cloud example) and offer data visualization and resource-usage alerts in addition to capable log collection. Commonly, secure log management policies will copy logs into storage in real time to guarantee their integrity.
5. Introduce strong identity management and authentication measures
Identity management is the backbone of robust access control. Secure your cloud data lake by integrating your identity provider and cloud provider; for example, leveraging Active Directory on AWS using SAML 2.0. Managing third-party applications or data lakes with multiple services can require a more complex array of authentication services, possibly positioning SAML clients and providers to make use of Auth0, OpenLDAP, Kerberos, Apache Knox, or others.
6. Leverage authorization controls
Cloud providers offer configurable data and resource access controls as part of their platform-as-a-service solutions. These identity and access management (IAM) policies and role-based access controls (RBACs) allow granular row-and-column-level access limitations. Use these capabilities to enforce least privilege access policies. AWS, for example, offers fine-grained access controls through their Lake Formation service, which automates the process to secure your data lake. Options for sharing data across services and accounts are available as well.
7. Enforce encryption
Cloud providers offer guidance in encryption best practices, which should be followed. Ensuring the effectiveness of this fundamental security function takes a strong grasp of IAM, policies for encryption key rotation, and how to configure applications. AWS users should learn AWS KMS best practices. Encryption must protect both data at rest and data in motion and may require self-provided certificates and an associated rotation regimen if using integrated third-party services.
8. Maintain vigilant vulnerability and patch management
Implement a comprehensive vulnerability and security patching strategy that combines automated detection, assessment of risk and severity, testing, and patch deployment. Use alternative mitigation techniques to bridge the time frame between detection, testing, and patch deployment. Turning off unnecessary services and utilizing firewall controls can all be effective solutions in mitigating the time your environment is vulnerable.
Visibility is the key factor in your vulnerability management program. Understanding every risk within your environment and prioritizing patching will shorten the opportunity for exploitation and data loss.
9. Practice compliance monitoring and incident response
Cloud security functions, including early threat detection, investigation, and response, call for an effective compliance monitoring and incident response plan. Consider integrating existing security information and event management (SIEM) infrastructure to perform cloud monitoring. Cloud deployments have unique threats that require training and experience to properly identify and resolve. Adopt incident response runbooks as a strategy to quickly and effectively react to security incidents.
10. Implement data loss prevention
Cloud data lake implementations leverage persistent data in cloud object storage in order to optimize and maintain availability and integrity. For example, Amazon S3 offers secure storage and high availability, as well as reliable performance.
In case of unintentional object replacement or deletion, object versioning and retention capabilities provide crucial redundancy. Evaluate and address data loss risks across all services that store or manage data. Robust authorization protections limiting access to delete and update functions will effectively reduce the risks of data loss due to user activity.
Wrapping Up
In the rush to pursue cloud data lake migration and modernization, security cannot be an afterthought – comprehensive and continual safeguards are imperative.
By following these best practices, or choosing solutions with end-to-end security built in, organizations can more confidently leverage the tremendous analytical benefits of cloud data lakes, while ensuring that their data remains protected.