An open-source tool that transforms object storage into a Git-like repository for versioned, atomic, and repeatable data lake operations.
lakeFS is an open-source data version control system that applies Git-like operations to data lakes. It transforms object storage into a versioned repository, enabling teams to branch, commit, merge, and rollback data changes. This solves critical problems in data management such as ensuring reproducibility, enabling safe testing, and maintaining data quality in production pipelines.
Data engineers, data scientists, and platform teams building and maintaining data lakes on cloud object storage (AWS S3, Azure Blob, GCS). It is particularly valuable for organizations needing reproducible data pipelines, isolated testing environments, and robust data governance.
Developers choose lakeFS because it brings the proven workflows of Git to data management, allowing atomic and versioned operations without copying data. Its seamless integration with existing data frameworks and S3 compatibility means teams can adopt it without overhauling their stack, gaining immediate benefits in data reliability and collaboration.
lakeFS - Data version control for your data lake | Git for data
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Enables branching, committing, merging, and rolling back data changes directly on object storage, similar to code version control, without duplicating data, as highlighted in the README's core features.
Works with AWS S3, Azure Blob Storage, and Google Cloud Storage, allowing deployment across major cloud providers, making it versatile for hybrid or multi-cloud data lakes.
API compatible with S3 and integrates with data frameworks like Spark, Hive, AWS Athena, and Presto, minimizing integration effort and fitting into existing data stacks.
Allows creation of branches for development and testing that are full copies of production data without duplication, enabling safe ETL testing, as emphasized in the 'Why Do I Need lakeFS?' section.
Requires running and maintaining a lakeFS server, which adds complexity compared to native object storage management, especially for production deployments beyond the quickstart.
Primarily designed for batch data processing and versioning; not optimized for real-time streaming or transactional data updates, which may limit use cases requiring immediate consistency.
Tied to supported cloud storage services (S3, Azure Blob, GCS), so it may not suit organizations with legacy on-premises file systems or non-compatible storage backends.