AWS Deployment

This guide covers deploying ML pipelines to AWS using ephemeral EC2 instances.

Overview

The AWS deployment uses:

Terraform: Infrastructure as Code
EC2: Ephemeral compute instances
S3: Data and artifact storage
IAM: Access management

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        AWS Account                          │
│                      (064592191516)                         │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────┐   │
│  │                   VPC (10.0.0.0/16)                 │   │
│  │  ┌───────────────────────────────────────────────┐  │   │
│  │  │           Public Subnet (10.0.1.0/24)         │  │   │
│  │  │  ┌─────────────────────────────────────────┐  │  │   │
│  │  │  │         Ephemeral EC2 Instance          │  │  │   │
│  │  │  │  ┌─────────────┐  ┌─────────────────┐   │  │  │   │
│  │  │  │  │ mlflow-tf   │  │ mlflow-sklearn  │   │  │  │   │
│  │  │  │  └─────────────┘  └─────────────────┘   │  │  │   │
│  │  │  │  ┌─────────────────────────────────┐    │  │  │   │
│  │  │  │  │       MLflow Server (:5000)     │    │  │  │   │
│  │  │  │  └─────────────────────────────────┘    │  │  │   │
│  │  │  └─────────────────────────────────────────┘  │  │   │
│  │  └───────────────────────────────────────────────┘  │   │
│  └─────────────────────────────────────────────────────┘   │
│                           │                                 │
│                           ▼                                 │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              S3 Bucket (064592191516-mlflow)        │   │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────┐  │   │
│  │  │ Training Data│  │   MLruns     │  │   Logs   │  │   │
│  │  └──────────────┘  └──────────────┘  └──────────┘  │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Directory Structure

deploy/aws/064592191516/us-east-1/
├── 01-infrastructure/           # Terraform
│   ├── main.tf
│   └── userdata.sh
├── 02-configuration/            # Scripts
│   ├── launch_pipeline.sh
│   ├── terminate_instance.sh
│   └── check_status.sh
├── 03-application-mlflow-tf/    # TF config
│   └── config.json
├── 04-application-mlflow-sklearn/ # Sklearn config
│   └── config.json
├── 05-datasources/
└── 06-identities/

Prerequisites

AWS CLI configured with credentials
Terraform >= 1.0.0
SSH key pair (~/.ssh/id_rsa.pub)
S3 bucket for Terraform state

Initial Setup

1. Initialize Terraform

cd deploy/aws/064592191516/us-east-1/01-infrastructure

terraform init

2. Review and Apply

# Review changes
terraform plan

# Apply infrastructure
terraform apply

This creates:

VPC with public subnet
Security group (SSH, MLflow)
IAM role with S3/ECR permissions
EC2 launch template
CloudWatch log group

3. Verify Outputs

terraform output

Running Pipelines

Launch Ephemeral Instance

cd deploy/aws/064592191516/us-east-1/02-configuration

# Run both pipelines
./launch_pipeline.sh all

# Run specific pipeline
./launch_pipeline.sh sklearn
./launch_pipeline.sh tf

# Use GPU instance
./launch_pipeline.sh tf --gpu

# Auto-terminate after completion
./launch_pipeline.sh all --terminate

Check Status

./check_status.sh

Output:

=== Instance Status ===
Instance ID: i-0abc123def456
State: running
Public IP: 54.123.45.67

SSH: ssh -i ~/.ssh/mlops-pipeline-key.pem ubuntu@54.123.45.67
MLflow UI: http://54.123.45.67:5000

Terminate Instance

./terminate_instance.sh

Instance Types

Type	vCPUs	Memory	GPU	Cost/hr	Use Case
t3.xlarge	4	16GB	-	$0.17	sklearn, small models
g4dn.xlarge	4	16GB	T4	$0.53	TensorFlow training

Configuration

Terraform Variables

Edit 01-infrastructure/main.tf:

variable "instance_type" {
  default = "t3.xlarge"
}

variable "gpu_instance_type" {
  default = "g4dn.xlarge"
}

variable "allowed_ssh_cidr" {
  default = "YOUR_IP/32"  # Restrict SSH access
}

Pipeline Configuration

Edit config files in 03-application-*:

{
    "mlflow": {
        "tracking_uri": "http://localhost:5000",
        "artifact_location": "s3://064592191516-mlflow/mlruns"
    },
    "aws": {
        "region": "us-east-1",
        "s3_bucket": "064592191516-mlflow"
    }
}

Security

SSH Access

Restrict SSH to your IP:

variable "allowed_ssh_cidr" {
  default = "203.0.113.0/24"  # Your IP range
}

IAM Permissions

The instance role has:

S3: Read/write to mlflow bucket
ECR: Pull/push images
CloudWatch: Write logs

Encryption

EBS volumes are encrypted
S3 bucket versioning enabled

Cost Management

Spot Instances (Optional)

For non-critical workloads, modify the launch template:

instance_market_options {
  market_type = "spot"
  spot_options {
    max_price = "0.10"
  }
}

Auto-Termination

Always use --terminate flag for batch jobs:

./launch_pipeline.sh all --terminate

Cleanup

Terminate running instances:

# List all pipeline instances
aws ec2 describe-instances \
    --filters "Name=tag:Project,Values=mlops-with-mlflow" \
    --query 'Reservations[].Instances[?State.Name==`running`].[InstanceId,PublicIpAddress]'

# Terminate all
./terminate_instance.sh

Monitoring

CloudWatch Logs

View logs in AWS Console or CLI:

aws logs tail /mlops/pipeline --follow

Instance Logs

SSH into instance:

ssh -i ~/.ssh/mlops-pipeline-key.pem ubuntu@<IP>

# Setup logs
tail -f /var/log/userdata.log

# Pipeline logs
tail -f /opt/mlops/logs/pipeline.log

# MLflow server logs
sudo journalctl -u mlflow-server -f

Teardown

Remove All Infrastructure

cd deploy/aws/064592191516/us-east-1/01-infrastructure
terraform destroy

⚠️ This deletes all resources including VPC, security groups, and IAM roles.