A CLI tool that applies Git-like version control to cloud storage, enabling distributed, decentralized, and deduplicated data repositories.
s3git is a command-line tool that applies Git-like version control to cloud storage, enabling distributed and decentralized data repositories. It solves the problem of managing huge datasets (up to petabytes) with versioning, deduplication, and the ability to work locally on SSDs while storing data in S3-compatible backends. Unlike Git, it's optimized for large binary files and scales to hundreds of millions of objects.
Developers, DevOps engineers, and data engineers who need to version, manage, and collaborate on large binary datasets, such as build artifacts, multimedia files, analytics data, or release binaries, using cloud storage backends.
Developers choose s3git for its seamless integration with cloud storage, infinite scalability, and familiar Git workflow, making it the go‑open-source solution for versioning massive datasets without server overhead. Its deduplication and directory snapshotting provide efficient storage and reliable rollbacks.
s3git: git for Cloud Storage. Distributed Version Control for Data. Create decentralized and versioned repos that scale infinitely to 100s of millions of files. Clone huge PB-scale repos on your local SSD to make changes, commit and push back. Oh yeah, it dedupes too and offers directory versioning.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses CLI commands identical to Git (init, add, commit, etc.), making it easy for developers to adopt without retraining, as emphasized in the README's 'If you know git, you will know how to use s3git!'
Supports repositories with hundreds of millions of files and petabytes of data, leveraging cloud storage backends like S3 for infinite growth, as highlighted in the key features.
Enables cloning and working on huge repositories locally with minimal disk space—demonstrated by cloning an 11.5 TB dataset using only 7 GB on an SSD.
Automatically avoids storing duplicate data, optimizing storage usage across versions and commits, which is a core feature for managing large datasets.
As admitted in the FAQ, encryption must be handled manually by piping data through external tools like openssl, adding complexity and potential security gaps for sensitive workflows.
Binaries are labeled as pre-release with a disclaimer to 'use at your own peril,' making it risky for production environments and indicating potential instability.
Lacks FUSE support, preventing POSIX-compliant mounting and seamless integration with standard file-based applications, as noted in the FAQ to avoid complexity.