A transactional catalog for data lakes with Git-like semantics, enabling version control and branching for data assets.
Project Nessie is a transactional catalog for data lakes that provides Git-like version control semantics for data assets. It enables data teams to branch, merge, and commit changes to tables and views, bringing software engineering workflows to data management. Nessie solves the problem of collaborative data development by allowing multiple users to work on data simultaneously without conflicts.
Data engineers and data platform teams working with data lakes who need version control, branching, and collaboration capabilities for their data assets. It's particularly useful for organizations using Apache Iceberg and tools like Spark, Hive, Flink, Presto, or Trino.
Developers choose Nessie because it brings familiar Git workflows to data management, enabling collaborative data development with proper version control. Its integration with Apache Iceberg and wide tool support makes it a practical solution for modern data lakes without requiring proprietary platforms.
Nessie: Transactional Catalog for Data Lakes with Git-like semantics
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Enables branching, merging, and committing for data tables, similar to software development, allowing teams to collaborate on data assets without conflicts, as highlighted in the key features.
Supports a wide range of data tools including Spark, Hive, Flink, Presto, and Trino, with a detailed compatibility table ensuring interoperability across ecosystems.
Can be self-hosted via Docker or as a standalone service, providing control over deployment and configuration without vendor lock-in, as shown in the deployment instructions.
Ensures ACID transactions for data operations in distributed data lakes, offering reliability and data integrity for collaborative environments.
Enabling authentication requires configuring OpenID Connect and Quarkus properties, which can be intricate and error-prone for teams unfamiliar with these technologies, as noted in the README.
Primarily focused on Apache Iceberg tables; if your data lake uses other formats, integration is limited, potentially requiring additional work or compatibility layers.
The compatibility table shows specific versions for tools like Spark and Iceberg, necessitating careful version management and updates to avoid breaking changes or conflicts.