A Terraform plugin for managing machine learning compute resources across AWS, GCP, Azure, and Kubernetes with spot instance recovery and auto-termination.
Terraform Provider Iterative (TPI) is a Terraform plugin built for managing machine learning compute resources across multiple cloud providers. It automates the provisioning, recovery, and termination of instances, including spot instances and GPUs, while providing a unified abstraction layer to avoid vendor lock-in. TPI solves the problem of complex cloud infrastructure management for ML workloads by offering a simple, codified approach.
Data scientists, machine learning engineers, and DevOps teams who need to run reproducible ML experiments and training jobs in the cloud without deep cloud expertise. It is also suitable for teams looking to reduce costs and simplify multi-cloud deployments.
Developers choose TPI because it eliminates the need for custom scripts or external orchestrators, reduces infrastructure costs through spot instance recovery and auto-termination, and provides a consistent CLI experience across AWS, GCP, Azure, and Kubernetes. Its design as a lightweight CLI tool with no control plane overhead sets it apart from traditional cloud orchestrators.
☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Automatically recovers interrupted spot/preemptible instances with transparent data checkpointing and respawning, significantly reducing costs for long-running ML tasks.
Switches between AWS, GCP, Azure, and Kubernetes with a single line change in configuration, preventing vendor lock-in and simplifying multi-cloud strategies.
Auto-terminates instances upon task completion and removes storage after downloading results, ensuring users only pay for actual resource usage without manual cleanup.
Offers one-command data sync and code execution with no external server, making cloud resources feel like a local laptop and reducing operational overhead.
The README notes future plans for more native multi-instance training support, indicating current limitations for complex, distributed ML workloads.
Requires Terraform installation and familiarity, adding setup steps and learning curve for teams not already using infrastructure as code tools.
Lacks tight integration with tools like DVC and advanced monitoring features, as admitted in future plans, which may require additional tooling for full ML pipelines.