How to set up TPI with AWS credentials?

Expose AWS authentication via environment variables as per the docs, set cloud='aws' in your Terraform configuration, and run terraform init. Ensure Terraform is installed and credentials are properly configured for access.

TPI vs. Kubernetes for machine learning workloads?

TPI abstracts Kubernetes and other clouds for simplicity and cost control, ideal for quick setups. Use native Kubernetes for more fine-grained control and complex orchestrations, but it requires deeper expertise.

Does TPI support distributed training across multiple instances?

Currently, TPI has limited native support for distributed training; the README mentions it as a future plan. For now, you might need to handle distribution within your script or use additional tools.

How to monitor running ML tasks with TPI?

Use terraform refresh and terraform show to query status and view logs synced to persistent storage. However, real-time monitoring and interactive debugging are limited compared to dedicated cloud services.

Can TPI run Jupyter notebooks in the cloud?

Yes, example projects show how to run Jupyter with TPI by managing the infrastructure. It sets up cloud instances, but you'll need to configure Jupyter within your script for access.

How does TPI handle spot instance interruptions?

TPI uses cloud-native scaling groups and persistent storage to automatically checkpoint data and respawn instances, ensuring tasks resume from the last saved state without manual intervention.

TPI or custom scripts for ML infrastructure?

TPI reduces boilerplate with codified configurations and multi-cloud support, saving time. Custom scripts offer more flexibility but require ongoing maintenance and cloud expertise.

Open-Awesome

terraform-provider-iterative

Apache-2.0Gov0.11.20

A Terraform plugin for managing machine learning compute resources across AWS, GCP, Azure, and Kubernetes with spot instance recovery and auto-termination.

Visit Website GitHub

295 stars30 forks0 contributors

What is terraform-provider-iterative?

Terraform Provider Iterative (TPI) is a Terraform plugin built for managing machine learning compute resources across multiple cloud providers. It automates the provisioning, recovery, and termination of instances, including spot instances and GPUs, while providing a unified abstraction layer to avoid vendor lock-in. TPI solves the problem of complex cloud infrastructure management for ML workloads by offering a simple, codified approach.

Target Audience

Data scientists, machine learning engineers, and DevOps teams who need to run reproducible ML experiments and training jobs in the cloud without deep cloud expertise. It is also suitable for teams looking to reduce costs and simplify multi-cloud deployments.

Value Proposition

Developers choose TPI because it eliminates the need for custom scripts or external orchestrators, reduces infrastructure costs through spot instance recovery and auto-termination, and provides a consistent CLI experience across AWS, GCP, Azure, and Kubernetes. Its design as a lightweight CLI tool with no control plane overhead sets it apart from traditional cloud orchestrators.

Overview

☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes

Use Cases

Best For

Running machine learning training jobs on spot instances with automatic recovery
Managing GPU resources across multiple cloud providers with a single configuration
Automating infrastructure cleanup to avoid unused resource costs
Simplifying cloud deployments for data science teams without DevOps expertise
Creating reproducible ML environments codified in Terraform configuration files
Integrating cloud compute into CI/CD pipelines for ML model delivery

Not Ideal For

Projects requiring real-time, interactive cloud instance management and debugging
Teams deeply invested in a single cloud provider's native ML services and tooling
Simple ad-hoc scripts where infrastructure as code adds unnecessary complexity
Environments where Terraform is not already adopted or is considered overkill

Pros & Cons

Pros

Spot Instance Recovery

Automatically recovers interrupted spot/preemptible instances with transparent data checkpointing and respawning, significantly reducing costs for long-running ML tasks.

Multi-Cloud Abstraction

Switches between AWS, GCP, Azure, and Kubernetes with a single line change in configuration, preventing vendor lock-in and simplifying multi-cloud strategies.

Cost Optimization

Auto-terminates instances upon task completion and removes storage after downloading results, ensuring users only pay for actual resource usage without manual cleanup.

Developer-First CLI

Offers one-command data sync and code execution with no external server, making cloud resources feel like a local laptop and reducing operational overhead.

Cons

Limited Distributed Training

The README notes future plans for more native multi-instance training support, indicating current limitations for complex, distributed ML workloads.

Terraform Dependency Complexity

Requires Terraform installation and familiarity, adding setup steps and learning curve for teams not already using infrastructure as code tools.

Ecosystem Integration Gaps

Lacks tight integration with tools like DVC and advanced monitoring features, as admitted in future plans, which may require additional tooling for full ML pipelines.

Frequently Asked Questions

Related Projects

terraform-provider-dominos

The Terraform plugin for the Dominos Pizza provider.

Stars1,177

Forks89

Last commit2 years ago

terraform-provider-github

Terraform GitHub provider

Stars1,149

Forks1,008

Last commit23 hours ago

terraform-provider-keycloak

Terraform provider for Keycloak

Stars935

Forks431

Last commit3 days ago

terraform-provider-hcloud

Terraform Hetzner Cloud provider

Stars728

Forks94

Last commit1 day ago

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub