AWS Deployment
This guide covers deploying ML pipelines to AWS using ephemeral EC2 instances.
Overview
The AWS deployment uses:
- Terraform: Infrastructure as Code
- EC2: Ephemeral compute instances
- S3: Data and artifact storage
- IAM: Access management
Architecture
┌─────────────────────────────────────────────────────────────┐
│ AWS Account │
│ (064592191516) │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────┐ │
│ │ VPC (10.0.0.0/16) │ │
│ │ ┌───────────────────────────────────────────────┐ │ │
│ │ │ Public Subnet (10.0.1.0/24) │ │ │
│ │ │ ┌─────────────────────────────────────────┐ │ │ │
│ │ │ │ Ephemeral EC2 Instance │ │ │ │
│ │ │ │ ┌─────────────┐ ┌─────────────────┐ │ │ │ │
│ │ │ │ │ mlflow-tf │ │ mlflow-sklearn │ │ │ │ │
│ │ │ │ └─────────────┘ └─────────────────┘ │ │ │ │
│ │ │ │ ┌─────────────────────────────────┐ │ │ │ │
│ │ │ │ │ MLflow Server (:5000) │ │ │ │ │
│ │ │ │ └─────────────────────────────────┘ │ │ │ │
│ │ │ └─────────────────────────────────────────┘ │ │ │
│ │ └───────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ S3 Bucket (064592191516-mlflow) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │ │
│ │ │ Training Data│ │ MLruns │ │ Logs │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Directory Structure
deploy/aws/064592191516/us-east-1/
├── 01-infrastructure/ # Terraform
│ ├── main.tf
│ └── userdata.sh
├── 02-configuration/ # Scripts
│ ├── launch_pipeline.sh
│ ├── terminate_instance.sh
│ └── check_status.sh
├── 03-application-mlflow-tf/ # TF config
│ └── config.json
├── 04-application-mlflow-sklearn/ # Sklearn config
│ └── config.json
├── 05-datasources/
└── 06-identities/
Prerequisites
- AWS CLI configured with credentials
- Terraform >= 1.0.0
- SSH key pair (~/.ssh/id_rsa.pub)
- S3 bucket for Terraform state
Initial Setup
1. Initialize Terraform
cd deploy/aws/064592191516/us-east-1/01-infrastructure
terraform init
2. Review and Apply
# Review changes
terraform plan
# Apply infrastructure
terraform apply
This creates:
- VPC with public subnet
- Security group (SSH, MLflow)
- IAM role with S3/ECR permissions
- EC2 launch template
- CloudWatch log group
3. Verify Outputs
terraform output
Running Pipelines
Launch Ephemeral Instance
cd deploy/aws/064592191516/us-east-1/02-configuration
# Run both pipelines
./launch_pipeline.sh all
# Run specific pipeline
./launch_pipeline.sh sklearn
./launch_pipeline.sh tf
# Use GPU instance
./launch_pipeline.sh tf --gpu
# Auto-terminate after completion
./launch_pipeline.sh all --terminate
Check Status
./check_status.sh
Output:
=== Instance Status ===
Instance ID: i-0abc123def456
State: running
Public IP: 54.123.45.67
SSH: ssh -i ~/.ssh/mlops-pipeline-key.pem ubuntu@54.123.45.67
MLflow UI: http://54.123.45.67:5000
Terminate Instance
./terminate_instance.sh
Instance Types
| Type | vCPUs | Memory | GPU | Cost/hr | Use Case |
|---|---|---|---|---|---|
| t3.xlarge | 4 | 16GB | - | $0.17 | sklearn, small models |
| g4dn.xlarge | 4 | 16GB | T4 | $0.53 | TensorFlow training |
Configuration
Terraform Variables
Edit 01-infrastructure/main.tf:
variable "instance_type" {
default = "t3.xlarge"
}
variable "gpu_instance_type" {
default = "g4dn.xlarge"
}
variable "allowed_ssh_cidr" {
default = "YOUR_IP/32" # Restrict SSH access
}
Pipeline Configuration
Edit config files in 03-application-*:
{
"mlflow": {
"tracking_uri": "http://localhost:5000",
"artifact_location": "s3://064592191516-mlflow/mlruns"
},
"aws": {
"region": "us-east-1",
"s3_bucket": "064592191516-mlflow"
}
}
Security
SSH Access
Restrict SSH to your IP:
variable "allowed_ssh_cidr" {
default = "203.0.113.0/24" # Your IP range
}
IAM Permissions
The instance role has:
- S3: Read/write to mlflow bucket
- ECR: Pull/push images
- CloudWatch: Write logs
Encryption
- EBS volumes are encrypted
- S3 bucket versioning enabled
Cost Management
Spot Instances (Optional)
For non-critical workloads, modify the launch template:
instance_market_options {
market_type = "spot"
spot_options {
max_price = "0.10"
}
}
Auto-Termination
Always use --terminate flag for batch jobs:
./launch_pipeline.sh all --terminate
Cleanup
Terminate running instances:
# List all pipeline instances
aws ec2 describe-instances \
--filters "Name=tag:Project,Values=mlops-with-mlflow" \
--query 'Reservations[].Instances[?State.Name==`running`].[InstanceId,PublicIpAddress]'
# Terminate all
./terminate_instance.sh
Monitoring
CloudWatch Logs
View logs in AWS Console or CLI:
aws logs tail /mlops/pipeline --follow
Instance Logs
SSH into instance:
ssh -i ~/.ssh/mlops-pipeline-key.pem ubuntu@<IP>
# Setup logs
tail -f /var/log/userdata.log
# Pipeline logs
tail -f /opt/mlops/logs/pipeline.log
# MLflow server logs
sudo journalctl -u mlflow-server -f
Teardown
Remove All Infrastructure
cd deploy/aws/064592191516/us-east-1/01-infrastructure
terraform destroy
⚠️ This deletes all resources including VPC, security groups, and IAM roles.