Building MLOps Infrastructure: MLFlow, Airflow, and Ray on Kubernetes

Production MLOps stack on AWS EKS—MLFlow experiment tracking, Airflow orchestration, Ray distributed training, and Evidently AI drift monitoring.

Krishna C

August 15, 2022

•

3 min read

•

Updated March 20, 2023

I built MLOps infrastructure on AWS EKS that data scientists can use directly without waiting on platform engineers. The stack: Airflow for orchestration, MLFlow for experiment tracking, Ray for distributed training, and Evidently AI for drift monitoring—all packaged in Helm charts for consistent environments.

Automated Workflows with Airflow

Data scientists write Airflow DAGs defining their ML pipelines. KubernetesExecutor spawns isolated pods for each task.

1from airflow import DAG
2from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
3
4with DAG("customer_churn_prediction", schedule_interval="@daily") as dag:
5
6    load_data = KubernetesPodOperator(
7        task_id="load_data",
8        image="ml-pipeline:latest",
9        cmds=["python", "load_data.py"],
10        namespace="ml-workflows",
11        resources={"request_memory": "4Gi", "request_cpu": "2"}
12    )
13
14    train_model = KubernetesPodOperator(
15        task_id="train_model",
16        image="ml-pipeline:latest",
17        cmds=["python", "train_model.py"],
18        namespace="ml-workflows",
19        resources={"request_memory": "16Gi", "request_cpu": "8"}
20    )
21
22    evaluate_drift = KubernetesPodOperator(
23        task_id="evaluate_drift",
24        image="evidently-monitor:latest",
25        cmds=["python", "check_drift.py"],
26        namespace="ml-workflows",
27    )
28
29    load_data >> train_model >> evaluate_drift

Airflow provides visual DAG interface, automatic retries, cron or event-driven scheduling, and resource isolation per task. MLFlow tracking happens inside task code—parameters, metrics, and artifacts automatically logged.

Experiment Tracking with MLFlow

MLFlow tracking server runs in the cluster, backed by S3 for artifacts and PostgreSQL for metadata. Open source, no vendor lock-in.

Every experiment run logs:

Parameters: Hyperparameters, feature configs, algorithm choices
Metrics: Accuracy, precision, recall, loss curves
Artifacts: Model files, plots, feature importance charts
Environment: Python version, dependencies, system config

Data scientists compare experiments through the MLFlow UI. Promote good models to the registry with a single click. S3 storage means no disk space worries for multi-GB model artifacts.

Drift Monitoring with Evidently AI

Models degrade as data distributions shift. Evidently AI provides continuous monitoring integrated with MLFlow.

1from evidently.report import Report
2from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
3import mlflow
4
5def monitor_model_drift(model_name, production_data, reference_data):
6    report = Report(metrics=[DataDriftPreset(), TargetDriftPreset()])
7    report.run(reference_data=reference_data, current_data=production_data)
8
9    mlflow.log_artifact(report.save_html("drift_report.html"))
10    mlflow.log_metrics({
11        "data_drift_score": report.as_dict()["metrics"][0]["result"]["drift_score"],
12        "target_drift_detected": report.as_dict()["metrics"][1]["result"]["drift_detected"]
13    })

Airflow DAG runs Evidently reports nightly. Statistical tests (KS, PSI) identify feature drift. Slack notifications when thresholds exceeded. Severe drift triggers retraining DAGs with human approval before production deployment.

Distributed Training with Ray

Large models require distributed training. Ray Cluster on Kubernetes handles coordination and scaling.

1from ray import train
2from ray.train import ScalingConfig
3from ray.train.xgboost import XGBoostTrainer
4import mlflow
5
6def train_distributed_model():
7    mlflow.set_tracking_uri("http://mlflow-server:5000")
8    mlflow.start_run()
9
10    trainer = XGBoostTrainer(
11        scaling_config=ScalingConfig(
12            num_workers=4,
13            use_gpu=True,
14            resources_per_worker={"CPU": 4, "GPU": 1}
15        ),
16        label_column="target",
17        params={"objective": "binary:logistic", "max_depth": 6},
18    )
19
20    result = trainer.fit()
21    mlflow.log_params(trainer.params)
22    mlflow.log_metrics(result.metrics)
23    mlflow.end_run()

Data scientists spin up their own Ray clusters via a Jenkins job that deploys the cluster and provides a dashboard endpoint. Workers scale up during training, down when idle. Works with XGBoost, LightGBM, PyTorch, TensorFlow—same API regardless of framework.

What We Achieved

The platform removed the bottleneck between data scientists and infrastructure. Teams run experiments without tickets, track everything in MLFlow, and deploy models through GitOps. Drift monitoring catches model degradation before it affects production.

Most importantly: data scientists focus on models, not infrastructure.

Why Kubernetes

Kubernetes made this possible. Every component—Airflow, MLFlow, Ray, Evidently—runs as containers with consistent deployment patterns. Benefits we saw:

Resource efficiency: Autoscaling spins up nodes for training jobs, terminates them when idle. No paying for unused compute.
Isolation: Each experiment runs in its own pod. No dependency conflicts, no "works on my machine."
Self-service: Jenkins jobs and Helm charts let data scientists provision their own Ray clusters and environments.
Portability: Same Helm charts deploy to dev laptops, staging, and production. No environment drift.

The initial Kubernetes learning curve paid off quickly. Once the platform existed, adding new tools was just another Helm chart.

← Previous

Raising Curious Minds in the Age of AI

As a father of twin toddlers, I think about one question often: will LLMs help or hinder their curiosity? In a world where AI answers anything instantly, how do we raise problem solvers who still want to discover how things work?

Running Jenkins in Kubernetes: Why We Left EC2 Behind

Scaling Jenkins agents dynamically in Kubernetes beats static EC2 instances. Here's what worked, what broke, and how we solved Docker-in-Docker nightmares with BuildKit.