System Design: MLOps Pipeline with MLflow & Kubeflow

Difficulty: senior Estimated Time: 120 minutes Tags: Kubernetes, Kubeflow, MLflow, MLOps, System Design, Production ML, CI/CD

Problem Statement

Business Context

You're a senior ML engineer at a company building an AI-powered sentiment analysis chatbot. The product team wants to:

Deploy a sentiment classification model to production
Continuously improve the model based on user feedback
Ensure high availability (99.9% uptime)
Handle traffic spikes during peak hours

Technical Requirements

Model Training Pipeline: Automated, reproducible training
Experiment Tracking: Compare model versions, hyperparameters
Model Registry: Version control for models with staging/production stages
Production Serving: Low-latency inference (<500ms p99)
Monitoring: Track model performance, data drift, system health
Auto-scaling: Handle 100-10,000 requests/minute

Constraints

Team of 3 ML engineers
Must use existing Kubernetes infrastructure
Budget: $5,000/month for ML infrastructure
Model must be updated weekly with new training data

High-Level Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────────┐
│                        Production ML System Architecture                         │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│  1. DEVELOPMENT & EXPERIMENTATION                                                │
│  ┌────────────────────────────────────────────────────────────────┐             │
│  │               MLflow Tracking Server                            │             │
│  │  • Experiment tracking                                          │             │
│  │  • Model Registry (Dev → Staging → Production)                  │             │
│  │  • PostgreSQL Backend + S3 Artifacts                            │             │
│  └────────────────────────────────────────────────────────────────┘             │
│                              │                                                   │
│  2. AUTOMATED ML PIPELINE (Kubeflow)                                             │
│  ┌────────────────────────────────────────────────────────────────┐             │
│  │  Step 1: Data Validation → Step 2: Preprocessing                │             │
│  │  Step 3: Training (GPU) → Step 4: Evaluation                    │             │
│  │  Step 5: Registration → Step 6: Staging Deploy                  │             │
│  │  Step 7: Integration Tests → Step 8: Production Deploy          │             │
│  └────────────────────────────────────────────────────────────────┘             │
│                              │                                                   │
│  3. PRODUCTION SERVING (Kubernetes)                                              │
│  ┌────────────────────────────────────────────────────────────────┐             │
│  │  Ingress → LoadBalancer → Pods (2-5 replicas)                   │             │
│  │  HPA: Scale on CPU 70% target                                   │             │
│  │  Each Pod: FastAPI + DistilBERT Model                           │             │
│  └────────────────────────────────────────────────────────────────┘             │
│                              │                                                   │
│  4. SUPPORTING INFRASTRUCTURE                                                    │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐                 │
│  │  Redis Cache    │  │  Prometheus     │  │  Grafana        │                 │
│  │  30% hit rate   │  │  Metrics        │  │  Dashboards     │                 │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘                 │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

5-Layer Architecture Overview

Layer	Component	Purpose
1	MLflow	Experiment tracking, model versioning, registry
2	Kubeflow	Automated ML pipelines (8-step process)
3	Kubernetes	Production serving with auto-scaling
4	Redis	Semantic caching for repeated queries
5	Prometheus + Grafana	Monitoring and observability

Data Flow

Training Data → S3 bucket (raw data)
Kubeflow Pipeline → Validates, preprocesses, trains model
MLflow → Logs experiments, registers model versions
Model Registry → Promotes model through stages (Dev → Staging → Prod)
Kubernetes → Serves model via FastAPI endpoints
Redis → Caches frequent predictions
Prometheus → Collects metrics from all components

Component Deep-Dives

Layer 1: MLflow Tracking Server

Purpose: Central hub for experiment tracking and model versioning

┌─────────────────────────────────────────────────────┐
│                 MLflow Architecture                  │
├─────────────────────────────────────────────────────┤
│                                                      │
│  ┌─────────────┐    ┌─────────────┐                │
│  │  Tracking   │    │   Model     │                │
│  │   Server    │    │  Registry   │                │
│  └──────┬──────┘    └──────┬──────┘                │
│         │                  │                        │
│         ▼                  ▼                        │
│  ┌─────────────────────────────────┐               │
│  │         PostgreSQL DB            │               │
│  │  • Experiment metadata           │               │
│  │  • Run parameters/metrics        │               │
│  │  • Model versions                │               │
│  └─────────────────────────────────┘               │
│         │                                           │
│         ▼                                           │
│  ┌─────────────────────────────────┐               │
│  │         S3 Artifact Store        │               │
│  │  • Model binaries                │               │
│  │  • Training artifacts            │               │
│  │  • Evaluation reports            │               │
│  └─────────────────────────────────┘               │
│                                                      │
└─────────────────────────────────────────────────────┘

Model Registry Stages:

None → Initial upload
Staging → Passed automated tests
Production → Approved for live traffic
Archived → Deprecated versions

Layer 2: Kubeflow 8-Step Pipeline

┌──────────────────────────────────────────────────────────────────────┐
│                    Kubeflow Pipeline (8 Steps)                        │
├──────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐              │
│  │  Step 1 │──▶│  Step 2 │──▶│  Step 3 │──▶│  Step 4 │              │
│  │  Data   │   │  Pre-   │   │Training │   │  Eval   │              │
│  │  Valid  │   │ process │   │  (GPU)  │   │         │              │
│  └─────────┘   └─────────┘   └─────────┘   └────┬────┘              │
│                                                  │                    │
│                                                  ▼                    │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐              │
│  │  Step 8 │◀──│  Step 7 │◀──│  Step 6 │◀──│  Step 5 │              │
│  │  Prod   │   │  Integ  │   │ Staging │   │Register │              │
│  │ Deploy  │   │  Tests  │   │ Deploy  │   │         │              │
│  └─────────┘   └─────────┘   └─────────┘   └─────────┘              │
│                                                                       │
└──────────────────────────────────────────────────────────────────────┘

Step	Name	Purpose	Duration
1	Data Validation	Schema checks, null detection	2 min
2	Preprocessing	Tokenization, train/test split	5 min
3	Training	Fine-tune DistilBERT (GPU)	45 min
4	Evaluation	Calculate accuracy, F1, confusion matrix	3 min
5	Registration	Register model in MLflow if metrics pass	1 min
6	Staging Deploy	Deploy to staging environment	5 min
7	Integration Tests	Run E2E tests against staging	10 min
8	Production Deploy	Blue-green deployment to prod	5 min

Total Pipeline Duration: ~76 minutes

Layer 3: Kubernetes Production Serving

┌──────────────────────────────────────────────────────────────────┐
│                  Kubernetes Serving Architecture                  │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│  Internet → Ingress → LoadBalancer                               │
│                          │                                        │
│                          ▼                                        │
│              ┌─────────────────────┐                             │
│              │    Service (L4)     │                             │
│              └──────────┬──────────┘                             │
│                         │                                         │
│         ┌───────────────┼───────────────┐                        │
│         ▼               ▼               ▼                         │
│    ┌─────────┐    ┌─────────┐    ┌─────────┐                    │
│    │  Pod 1  │    │  Pod 2  │    │  Pod 3  │                    │
│    │ FastAPI │    │ FastAPI │    │ FastAPI │                    │
│    │DistilBERT│   │DistilBERT│   │DistilBERT│                   │
│    └─────────┘    └─────────┘    └─────────┘                    │
│         │               │               │                         │
│         └───────────────┼───────────────┘                        │
│                         ▼                                         │
│              ┌─────────────────────┐                             │
│              │  HorizontalPodAuto  │                             │
│              │  scaler (HPA)       │                             │
│              │  Min: 2, Max: 5     │                             │
│              │  Target CPU: 70%    │                             │
│              └─────────────────────┘                             │
│                                                                   │
└──────────────────────────────────────────────────────────────────┘

Pod Specifications:

CPU: 2 cores (request), 4 cores (limit)
Memory: 4GB (request), 8GB (limit)
Model: DistilBERT (~250MB)
Framework: FastAPI + Uvicorn

Layer 4: Redis Semantic Caching

┌────────────────────────────────────────────────────────┐
│                 Redis Caching Strategy                  │
├────────────────────────────────────────────────────────┤
│                                                         │
│  Request: "This product is amazing!"                   │
│                  │                                      │
│                  ▼                                      │
│         ┌───────────────┐                              │
│         │  Hash Query   │                              │
│         │  (SHA256)     │                              │
│         └───────┬───────┘                              │
│                 │                                       │
│                 ▼                                       │
│         ┌───────────────┐                              │
│         │  Redis Lookup │                              │
│         └───────┬───────┘                              │
│                 │                                       │
│         ┌───────┴───────┐                              │
│         │               │                               │
│    Cache Hit       Cache Miss                           │
│    (30%)           (70%)                                │
│         │               │                               │
│         ▼               ▼                               │
│    Return          Run Model                            │
│    Cached          Inference                            │
│    (5ms)           (400ms)                              │
│                         │                               │
│                         ▼                               │
│                  Store in Cache                         │
│                  (TTL: 1 hour)                          │
│                                                         │
└────────────────────────────────────────────────────────┘

Cache Statistics:

Hit Rate: ~30%
Latency (cached): 5ms
Latency (uncached): 400ms
Average Latency: ~280ms
TTL: 1 hour
Max Cache Size: 10,000 entries

Integration Patterns & Code

Integration Patterns & Code Examples

CI/CD Flow

┌─────────────────────────────────────────────────────────────────────┐
│                        CI/CD Pipeline Flow                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Developer Push                                                      │
│       │                                                              │
│       ▼                                                              │
│  ┌─────────┐                                                        │
│  │ GitHub  │──────────────────────────────────────┐                 │
│  │ Actions │                                       │                 │
│  └────┬────┘                                       │                 │
│       │                                            │                 │
│       ▼                                            ▼                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐   ┌─────────┐           │
│  │  Lint   │───▶│  Test   │───▶│  Build  │───▶│ Trigger │           │
│  │  Code   │    │  Unit   │    │  Docker │   │Kubeflow │           │
│  └─────────┘    └─────────┘    └─────────┘   └────┬────┘           │
│                                                    │                 │
│                                                    ▼                 │
│  ┌──────────────────────────────────────────────────────────┐      │
│  │                 Kubeflow Pipeline                         │      │
│  │  Data Val → Preprocess → Train → Eval → Register         │      │
│  │  → Staging → Integration Tests → Production              │      │
│  └──────────────────────────────────────────────────────────┘      │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Key Integration Points

GitHub Actions → Kubeflow: Trigger pipeline on code push
Kubeflow → MLflow: Log experiments, register models
MLflow → Kubernetes: Deploy registered models
Kubernetes → Prometheus: Export metrics
Prometheus → Grafana: Visualize dashboards

Environment Configuration

Environment	Purpose	Traffic	Model Version
Development	Testing	0%	Latest
Staging	Pre-production validation	0%	Staging
Production	Live users	100%	Production

Scaling & Production Considerations

Blue-Green Deployment Strategy

┌─────────────────────────────────────────────────────────────────────┐
│                    Blue-Green Deployment                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Before Deployment:                                                  │
│  ┌─────────────────────────────────┐                                │
│  │  LoadBalancer                    │                                │
│  │         │                        │                                │
│  │         ▼                        │                                │
│  │  ┌─────────────┐  ┌─────────────┐│                               │
│  │  │ BLUE (v1.0) │  │ GREEN (v1.0)││                               │
│  │  │  Active     │  │  Standby    ││                               │
│  │  │  100%       │  │  0%         ││                               │
│  │  └─────────────┘  └─────────────┘│                               │
│  └─────────────────────────────────┘                                │
│                                                                      │
│  After Deployment:                                                   │
│  ┌─────────────────────────────────┐                                │
│  │  LoadBalancer                    │                                │
│  │         │                        │                                │
│  │         ▼                        │                                │
│  │  ┌─────────────┐  ┌─────────────┐│                               │
│  │  │ BLUE (v1.0) │  │ GREEN (v1.1)││                               │
│  │  │  Standby    │  │  Active     ││                               │
│  │  │  0%         │  │  100%       ││                               │
│  │  └─────────────┘  └─────────────┘│                               │
│  └─────────────────────────────────┘                                │
│                                                                      │
│  Rollback: Switch traffic back to BLUE if issues detected           │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Auto-Scaling Configuration

Metric	Target	Min Pods	Max Pods
CPU Utilization	70%	2	5
Memory Utilization	80%	2	5
Request Latency	<500ms p99	2	5

Performance Metrics

Metric	Value
Model Latency (p50)	350ms
Model Latency (p99)	450ms
Cache Hit Rate	30%
Cached Response Time	5ms
Throughput	100-500 RPS
Model Accuracy	94%
Uptime SLA	99.9%

Cost Breakdown (Monthly)

Component	Cost
Kubernetes (3-5 pods)	$1,500
GPU Training (spot)	$500
S3 Storage	$100
Redis Cache	$200
Monitoring Stack	$100
MLflow Server	$200
Total	~$2,600

Trade-offs & Alternatives

Why These Technology Choices?

Kubeflow vs Alternatives

Feature	Kubeflow	Airflow	Jenkins
ML-Native	Yes	No	No
Kubernetes Integration	Native	Add-on	Add-on
Experiment Tracking	Built-in	External	External
GPU Support	Native	Manual	Manual
Learning Curve	Medium	Low	Low
Best For	ML Pipelines	Data Pipelines	CI/CD

Decision: Kubeflow chosen for native ML support and Kubernetes integration.

MLflow vs Alternatives

Feature	MLflow	Weights & Biases	Neptune
Open Source	Yes	No	No
Self-Hosted	Yes	No	No
Cost	Free	$$$$	$$$
Model Registry	Yes	Yes	Yes
UI Quality	Good	Excellent	Good
Best For	Cost-Sensitive	Enterprise	Research

Decision: MLflow chosen for cost ($0) and self-hosting capability.

DistilBERT vs Alternatives

Model	Latency	Accuracy	Size
BERT-base	800ms	96%	440MB
DistilBERT	400ms	94%	250MB
BERT-tiny	100ms	88%	17MB

Decision: DistilBERT balances latency and accuracy for production.

Architecture Trade-offs

Decision	Pros	Cons
Blue-Green (vs Canary)	Simple rollback, no traffic splitting	2x resources during deploy
Redis Cache (vs in-memory)	Shared across pods, persistent	Network hop latency
PostgreSQL for MLflow	Reliability, ACID	Operational overhead
Kubernetes HPA (vs custom)	Built-in, well-tested	Limited to CPU/memory metrics

Future Improvements

Canary Deployments: Gradual rollout (10% → 50% → 100%)
Feature Store: Feast for feature management
A/B Testing: Model comparison in production
Multi-Model Serving: Triton Inference Server
Advanced Caching: Semantic similarity-based cache lookup

System Design: MLOps Pipeline with MLflow & Kubeflow

Question

System Design: MLOps Pipeline with MLflow & Kubeflow

Problem Statement

Problem Statement

Business Context

Technical Requirements

Constraints

High-Level Architecture

High-Level Architecture

Architecture Diagram

5-Layer Architecture Overview

Data Flow

Component Deep-Dives

Component Deep-Dives

Layer 1: MLflow Tracking Server

Layer 2: Kubeflow 8-Step Pipeline

Layer 3: Kubernetes Production Serving

Layer 4: Redis Semantic Caching

Integration Patterns & Code

Integration Patterns & Code Examples

CI/CD Flow

Key Integration Points

Environment Configuration

Scaling & Production Considerations

Scaling & Production Considerations

Blue-Green Deployment Strategy

Auto-Scaling Configuration

Performance Metrics

Cost Breakdown (Monthly)

Trade-offs & Alternatives

Trade-offs & Alternatives

Why These Technology Choices?

Kubeflow vs Alternatives

MLflow vs Alternatives

DistilBERT vs Alternatives

Architecture Trade-offs

Future Improvements

Your Solution