General
Infrastructure as Code for ML: Leveraging HashiCorp Terraform in Kled.io
Alex Chen
March 5, 2025 (1mo ago)
<h2>Introduction</h2>
<p>Machine learning infrastructure is complex, often requiring orchestration of various resources across multiple environments. The rise of ML operations (MLOps) highlights the critical need for consistent, reproducible, and scalable infrastructure. In the Kled.io platform, we've integrated HashiCorp Terraform to address these challenges, providing ML engineers with the tools they need to manage infrastructure as code.</p>
<p>This article explores how Terraform integration in Kled.io enables teams to automate provisioning of ML environments, ensuring consistency between development, testing, and production deployments.</p>
<h2>The Infrastructure Challenge in ML</h2>
<p>Machine learning workflows require specialized infrastructure components:</p>
<ul>
<li>GPU-enabled compute instances for training</li>
<li>Large-scale storage systems for datasets</li>
<li>Database systems for feature stores</li>
<li>Model serving infrastructure for inference</li>
<li>Monitoring and logging systems</li>
</ul>
<p>Managing these resources manually creates several problems:</p>
<ol>
<li><strong>Inconsistency</strong>: Environment differences between development and production</li>
<li><strong>Knowledge silos</strong>: Infrastructure configuration known only to specific team members</li>
<li><strong>Scaling challenges</strong>: Difficulty replicating infrastructure across regions or for larger workloads</li>
<li><strong>Slow iteration</strong>: Time-consuming provisioning processes hampering experimentation</li>
</ol>
<h2>Infrastructure as Code with Terraform</h2>
<p>HashiCorp Terraform addresses these challenges by enabling infrastructure as code (IaC), allowing teams to define infrastructure in configuration files that can be versioned, shared, and reused.</p>
<h3>Key Benefits</h3>
<ul>
<li><strong>Declarative configuration</strong>: Define the desired state rather than the steps to get there</li>
<li><strong>Version-controlled infrastructure</strong>: Track changes, understand history, and roll back when needed</li>
<li><strong>Multi-provider support</strong>: Manage resources across AWS, Azure, GCP, and on-premises environments</li>
<li><strong>Reusable modules</strong>: Create and share templates for common infrastructure patterns</li>
</ul>
<h2>Terraform Integration in Kled.io</h2>
<p>Kled.io's Terraform integration provides ML teams with:</p>
<h3>1. Pre-configured Templates</h3>
<p>Kled.io offers ready-to-use Terraform templates for common ML infrastructure patterns:</p>
<figure data-rehype-pretty-code-figure=""><pre tabindex="0" data-language="hcl" data-theme="min-light min-dark"><code data-language="hcl" data-theme="min-light min-dark" style="display: grid;"><span data-line=""><span style="--shiki-light:#6F42C1;--shiki-dark:#B392F0">module</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> "ml_training_cluster" {</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> source </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70"> "kled/ml-training-cluster/aws"</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> </span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> instance_type </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70"> "p3.8xlarge"</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> node_count </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#1976D2;--shiki-dark:#F8F8F8"> 4</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> vpc_id </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> var.vpc_id</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> subnet_ids </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> var.subnet_ids</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> </span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> storage_config </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> {</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> dataset_volume_size </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#1976D2;--shiki-dark:#F8F8F8"> 500</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> model_volume_size </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#1976D2;--shiki-dark:#F8F8F8"> 200</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> }</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0">}</span></span></code></pre></figure>
<h3>2. Environment Parity</h3>
<p>The same Terraform configurations can be used across development, staging, and production environments, with environment-specific variables managed through Terraform workspaces or variable files:</p>
<figure data-rehype-pretty-code-figure=""><pre tabindex="0" data-language="hcl" data-theme="min-light min-dark"><code data-language="hcl" data-theme="min-light min-dark" style="display: grid;"><span data-line=""><span style="--shiki-light:#C2C3C5;--shiki-dark:#6B737C"># dev.tfvars</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0">instance_type </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70"> "p3.2xlarge"</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0">node_count </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#1976D2;--shiki-dark:#F8F8F8"> 1</span></span>
<span data-line=""> </span>
<span data-line=""><span style="--shiki-light:#C2C3C5;--shiki-dark:#6B737C"># prod.tfvars</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0">instance_type </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70"> "p3.8xlarge"</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0">node_count </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#1976D2;--shiki-dark:#F8F8F8"> 4</span></span></code></pre></figure>
<h3>3. Remote State Management</h3>
<p>Kled.io manages Terraform state securely, enabling team collaboration and preventing state conflicts:</p>
<figure data-rehype-pretty-code-figure=""><pre tabindex="0" data-language="hcl" data-theme="min-light min-dark"><code data-language="hcl" data-theme="min-light min-dark" style="display: grid;"><span data-line=""><span style="--shiki-light:#6F42C1;--shiki-dark:#B392F0">terraform</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> {</span></span>
<span data-line=""><span style="--shiki-light:#6F42C1;--shiki-dark:#B392F0"> backend</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> "remote" {</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> organization </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70"> "kled-io"</span></span>
<span data-line=""><span style="--shiki-light:#6F42C1;--shiki-dark:#B392F0"> workspaces</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> {</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> name </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70"> "ml-project-prod"</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> }</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> }</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0">}</span></span></code></pre></figure>
<h3>4. Workflow Integration</h3>
<p>Terraform operations integrate with Kled.io's CI/CD pipelines, enabling infrastructure updates as part of your ML workflow:</p>
<ol>
<li>Push code changes to trigger pipeline</li>
<li>Review terraform plan output</li>
<li>Approve changes for application</li>
<li>Automatically provision or update resources</li>
</ol>
<h2>Real-World Example: ML Training Pipeline</h2>
<p>Let's examine how a real ML team uses Terraform within Kled.io to manage their training infrastructure:</p>
<figure data-rehype-pretty-code-figure=""><pre tabindex="0" data-language="hcl" data-theme="min-light min-dark"><code data-language="hcl" data-theme="min-light min-dark" style="display: grid;"><span data-line=""><span style="--shiki-light:#C2C3C5;--shiki-dark:#6B737C"># Configure GPU-accelerated training cluster</span></span>
<span data-line=""><span style="--shiki-light:#6F42C1;--shiki-dark:#B392F0">module</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> "training_cluster" {</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> source </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70"> "kled/ml-training-cluster/aws"</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> </span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> instance_type </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> var.training_instance_type</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> node_count </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> var.training_node_count</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> subnet_ids </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> module.network.private_subnet_ids</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> </span></span>
<span data-line=""><span style="--shiki-light:#C2C3C5;--shiki-dark:#6B737C"> # Enable auto-scaling for batch training jobs</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> auto_scaling </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> {</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> enabled </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#1976D2;--shiki-dark:#79B8FF"> true</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> min_nodes </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#1976D2;--shiki-dark:#F8F8F8"> 1</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> max_nodes </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#1976D2;--shiki-dark:#F8F8F8"> 8</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> scale_factor </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70"> "gpu_utilization"</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> }</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0">}</span></span>
<span data-line=""> </span>
<span data-line=""><span style="--shiki-light:#C2C3C5;--shiki-dark:#6B737C"># Configure dataset storage with versioning</span></span>
<span data-line=""><span style="--shiki-light:#6F42C1;--shiki-dark:#B392F0">module</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> "dataset_storage" {</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> source </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70"> "kled/ml-dataset-storage/aws"</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> </span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> bucket_name </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70"> "</span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">${</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70">var</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0">.</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70">project_name</span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">}</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70">-datasets"</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> versioning </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#1976D2;--shiki-dark:#79B8FF"> true</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> lifecycle_rules </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> [{</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> prefix </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70"> "raw/"</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> enabled </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#1976D2;--shiki-dark:#79B8FF"> true</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> expiration </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#1976D2;--shiki-dark:#F8F8F8"> 90</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> }]</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0">}</span></span>
<span data-line=""> </span>
<span data-line=""><span style="--shiki-light:#C2C3C5;--shiki-dark:#6B737C"># Set up model registry</span></span>
<span data-line=""><span style="--shiki-light:#6F42C1;--shiki-dark:#B392F0">module</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> "model_registry" {</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> source </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70"> "kled/ml-model-registry/aws"</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> </span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> registry_name </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70"> "</span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">${</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70">var</span><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0">.</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70">project_name</span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">}</span><span style="--shiki-light:#22863A;--shiki-dark:#FFAB70">-models"</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0"> retention_days </span><span style="--shiki-light:#D32F2F;--shiki-dark:#F97583">=</span><span style="--shiki-light:#1976D2;--shiki-dark:#F8F8F8"> 180</span></span>
<span data-line=""><span style="--shiki-light:#24292EFF;--shiki-dark:#B392F0">}</span></span></code></pre></figure>
<p>With this configuration in version control, the team can:</p>
<ul>
<li>Replicate the exact environment across regions</li>
<li>Scale resources up or down based on workload</li>
<li>Maintain a history of infrastructure changes</li>
<li>Onboard new team members with clear infrastructure documentation</li>
</ul>
<h2>Ethical Considerations</h2>
<p>When automating infrastructure provisioning, it's essential to consider:</p>
<ul>
<li><strong>Resource efficiency</strong>: Avoid over-provisioning by using auto-scaling and right-sizing resources</li>
<li><strong>Cost management</strong>: Set up budgeting and alerting to prevent unexpected expenses</li>
<li><strong>Environmental impact</strong>: Consider the carbon footprint of large-scale ML infrastructure</li>
</ul>
<h2>Conclusion</h2>
<p>HashiCorp Terraform integration in Kled.io transforms how ML teams provision and manage infrastructure. By codifying infrastructure requirements, teams can ensure consistency, improve collaboration, and accelerate their ML development cycle.</p>
<p>Future enhancements to Kled.io's Terraform integration will include:</p>
<ul>
<li>Expanded library of ML-specific modules</li>
<li>Cost optimization recommendations</li>
<li>Enhanced visualization of infrastructure</li>
<li>Integration with infrastructure policy frameworks</li>
</ul>
<p>As ML workflows become more complex, treating infrastructure as code becomes increasingly essential for maintaining scalability, reproducibility, and reliability in ML operations.</p>
<blockquote>
<p>"Infrastructure as code isn't just about automation—it's about creating a shared understanding of the environments where our models live."</p>
</blockquote>
<p><img src="https://images.unsplash.com/photo-1612673507976-3fcf170a1114?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" alt="Terraform Architecture Diagram"></p>
<p>For more information on Terraform integration in Kled.io, visit our <a href="https://kled.io/docs/terraform">documentation</a>.</p>