Advertisement

Taking the Chill Out of Selecting the Appropriate Iceberg Data Catalog

By on
Read more about author Alex Merced.

Over the past few years, the industry has increasingly recognized the need to adopt a data lakehouse architecture because of the inherent benefits. This approach improves data infrastructure costs and reduces time-to-insight by consolidating more data workloads into a single source of truth on the organization’s data lake. This is made possible by data lakehouse table formats like Apache Iceberg, Apache Hudi, and Delta Lake, which enable database-like tables on a data lake. These formats support ACID (atomicity, consistency, isolation, durability) transactions, schema evolution, and other features that replicate the functionality of data warehouses without the restrictions of a walled garden. Even traditional data warehouse platforms have adapted their data processing tools to work with these tables on data lakes.

Among the three table formats, Apache Iceberg has experienced a surge in popularity, as many companies have fully embraced Apache Iceberg as the standard format for the data lakehouse. This momentum has been so significant that Databricks – the creator of Apache Iceberg’s main competitor, Delta Lake – acquired Tabular, a company founded by the initial creators of Apache Iceberg at Netflix, which offers an enterprise catalog service for Iceberg tables. This development, along with the announcement of the open-source catalog, Polaris, signals that Apache Iceberg has achieved overwhelming support, encouraging many companies to confidently build data lakehouses on Apache Iceberg.

This leads us to the next challenge in replicating the data warehouse experience on the data lakehouse: cataloging tables so they are governable and discoverable across different tools. This is where a new race establishes the open standard at the catalog level in Apache Iceberg.

Iceberg Catalogs in Brief

Apache Iceberg catalogs differ from enterprise catalogs like Collibra. The former enables tools to discover tables for table portability, while the latter allows individuals to discover datasets, find context, and request access. If using Apache Spark or Upsolver for ingestion and then using another platform to run analytics on those Iceberg tables, the Iceberg catalog ensures all these tools can work consistently with the tables.

In the past, these catalogs required support to be developed for each catalog in every language that supports Iceberg (Java, Python, Rust, Go), resulting in inconsistent catalog support and posing a barrier to the “use the tools you want” paradigm of the data lakehouse. To address this, the Apache Iceberg project developed the “REST Catalog specification.” This openAPI specification establishes a standard for service catalogs by outlining the necessary server endpoints. It allows catalog services to be written in any language and used by clients in any language, ensuring that any tool supporting the specification can work with all catalogs implementing it, thereby significantly reducing catalog interoperability concerns.

While the legacy data lake catalog standard, Hive, can still be used, there are four modern open-source catalog solutions designed with the data lakehouse in mind. These solutions are Nessie, Gravitino, Polaris, and Unity Catalog OSS (not to be confused with the proprietary Databricks Unity Catalog, which is a different code base). All of these catalogs aim to not only make the discoverability of tables portable but to ensure that governance rules and other metadata are portable, providing consistent experiences with tables across the spectrum of Iceberg-supporting tools.

Tips for Selecting a Catalog

Among the four modern open-source options, Nessie is the oldest and most mature as an open-source project. However, it is still early in terms of seeing what these other options will offer to the Apache Iceberg lakehouse. When it comes to evaluating which option to choose, a few key considerations will help determine the most appropriate choice:

  • Deployment: Assess what it takes to deploy and self-manage the catalog. Does the project offer Docker images or Kubernetes Helm charts to make deployments easy and manageable?
  • Documentation: Check if the documentation provides detailed instructions on using the catalog. Are there blogs, tutorials, and examples available to help the organization get started?
  • Security and Governance: Evaluate the features for securing tables. How granular are the rules to secure the catalog across different tools and users?
  • Scalability: Examine the distributed functionality of the catalog to handle large-scale, multi-region data lakehouse environments.
  • Unique Features: Identify any unique offerings. For example, Nessie offers git-like catalog versioning, enabling branching, merging, tagging, rolling back commits, and other git-like semantics within the tables.
  • REST Catalog Support: While all these catalogs support the REST Catalog specification, they may not support all operations, such as writing or registering a table via the specification.

With these aspects in mind, it’s best to get hands-on and try various catalogs to see which is best for your organization. Many of these catalogs also offer enterprise cloud-managed versions for those who prefer not to manage them independently Additionally, thanks to the catalog migration tool available from the Nessie project, migrating between any of these catalogs is relatively straightforward.

Sizing Up Iceberg Options to Ensure Lakehouse Success 

Selecting the right Apache Iceberg catalog is crucial for maximizing the benefits of a data lakehouse architecture. The industry’s shift towards data lakehouses has underscored the importance of efficient, scalable, and flexible table management solutions. Apache Iceberg has emerged as the dominant table format, gaining widespread support from major companies and driving innovation in data cataloging.

When choosing an Iceberg catalog, consider factors such as deployment ease, comprehensive documentation, robust security and governance features, scalability, unique functionalities, and support for the REST Catalog specification. Nessie, Gravitino, Polaris, and Unity Catalog OSS each offer distinct advantages and capabilities, making it vital to evaluate them based on internal needs and infrastructure.

Take the time to thoroughly assess these criteria and experiment with different catalogs to determine the best fit for the organization. Whether managing the catalog in-house or opting for a cloud-managed service, the right catalog will enhance the portability, discoverability, and governance of the Iceberg tables. As the landscape evolves, staying informed and adaptable will ensure the data lakehouse remains efficient, secure, and future-proof.

Making an informed choice that aligns with the business’ long-term data strategy requires careful consideration and hands-on experimentation. But it is worth the effort. After all, taking the time to navigate the new modern catalog paradigm with confidence is a key step to ensuring the success of a data lakehouse effort.