Scikit-learn Pipeline

The mlflow-sklearn pipeline performs credit card fraud detection using Logistic Regression.

Pipeline Overview

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│ s01: Create     │────▶│ s02: CSV to     │────▶│ s03: Preprocess │
│ Training Data   │     │ Parquet         │     │                 │
└─────────────────┘     └─────────────────┘     └────────┬────────┘
                                                         │
┌─────────────────┐     ┌─────────────────┐     ┌────────▼────────┐
│ s06: Evaluate   │◀────│ s05: Score      │◀────│ s04: Train      │
│                 │     │                 │     │                 │
└────────┬────────┘     └─────────────────┘     └─────────────────┘
         │
         ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│ s07: Validate   │────▶│ s08: DOE Coarse │────▶│ s09: DOE Fine   │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Running the Pipeline

Full Pipeline

cd src/mlflow-sklearn
make all

Individual Steps

Run specific steps as needed:

# Step 1: Create training dataset
python mlflow_sklearn/s01_create_training_dataset.py

# Step 2: Convert CSV to Parquet
python mlflow_sklearn/s02_csv2parquet.py

# Step 3: Preprocessing
python mlflow_sklearn/s03_preprocessing.py

# Step 4: Train model
python mlflow_sklearn/s04_train.py

# Step 5: Score predictions
python mlflow_sklearn/s05_score.py

# Step 6: Evaluate model
python mlflow_sklearn/s06_evaluate.py

# Step 7: Validate model
python mlflow_sklearn/s07_validate.py

Hyperparameter Optimization

# Coarse search
python mlflow_sklearn/s08_doe_coarse.py

# Fine search
python mlflow_sklearn/s09_doe_fine.py

# Full DOE
python mlflow_sklearn/s10_doe_full.py

Data Requirements

The pipeline expects the credit card fraud dataset in S3:

  • Source: s3://064592191516-mlflow/creditcardfraud/creditcard.csv.zip
  • Format: CSV with fraud labels

Model Configuration

Edit the configuration in s04_train.py:

ParameterDefaultDescription
n_splits5Number of cross-validation folds
solverlbfgsOptimization algorithm
max_iter1000Maximum iterations

Outputs

After running the pipeline:

  • MLflow Experiment: scikit_learn_experiment
  • Model Artifact: log_reg_model
  • Metrics: Accuracy, Precision, Recall, F1-score

Viewing Results

# Start MLflow UI
make mlflow-ui

# Or directly
mlflow ui -p 1234

Open http://localhost:1234 to view experiments.