System Design: MLOps Pipeline with MLflow & Kubeflow
Design an MLOps pipeline with MLflow and Kubeflow. Covers experiment tracking, model registry, 8-step automated pipeline, Kubernetes serving, Redis caching, and Prometheus monitoring.
Question
System Design: MLOps Pipeline with MLflow & Kubeflow
Difficulty: senior Estimated Time: 120 minutes Tags: Kubernetes, Kubeflow, MLflow, MLOps, System Design, Production ML, CI/CD
Problem Statement
Problem Statement
Business Context
You're a senior ML engineer at a company building an AI-powered sentiment analysis chatbot. The product team wants to:
- Deploy a sentiment classification model to production
- Continuously improve the model based on user feedback
- Ensure high availability (99.9% uptime)
- Handle traffic spikes during peak hours
Technical Requirements
- Model Training Pipeline: Automated, reproducible training
- Experiment Tracking: Compare model versions, hyperparameters
- Model Registry: Version control for models with staging/production stages
- Production Serving: Low-latency inference (<500ms p99)
- Monitoring: Track model performance, data drift, system health
- Auto-scaling: Handle 100-10,000 requests/minute
Constraints
- Team of 3 ML engineers
- Must use existing Kubernetes infrastructure
- Budget: $5,000/month for ML infrastructure
- Model must be updated weekly with new training data
High-Level Architecture
High-Level Architecture
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Production ML System Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. DEVELOPMENT & EXPERIMENTATION β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β MLflow Tracking Server β β
β β β’ Experiment tracking β β
β β β’ Model Registry (Dev β Staging β Production) β β
β β β’ PostgreSQL Backend + S3 Artifacts β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β 2. AUTOMATED ML PIPELINE (Kubeflow) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Step 1: Data Validation β Step 2: Preprocessing β β
β β Step 3: Training (GPU) β Step 4: Evaluation β β
β β Step 5: Registration β Step 6: Staging Deploy β β
β β Step 7: Integration Tests β Step 8: Production Deploy β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β 3. PRODUCTION SERVING (Kubernetes) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Ingress β LoadBalancer β Pods (2-5 replicas) β β
β β HPA: Scale on CPU 70% target β β
β β Each Pod: FastAPI + DistilBERT Model β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β 4. SUPPORTING INFRASTRUCTURE β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β Redis Cache β β Prometheus β β Grafana β β
β β 30% hit rate β β Metrics β β Dashboards β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5-Layer Architecture Overview
| Layer | Component | Purpose |
|---|---|---|
| 1 | MLflow | Experiment tracking, model versioning, registry |
| 2 | Kubeflow | Automated ML pipelines (8-step process) |
| 3 | Kubernetes | Production serving with auto-scaling |
| 4 | Redis | Semantic caching for repeated queries |
| 5 | Prometheus + Grafana | Monitoring and observability |
Data Flow
- Training Data β S3 bucket (raw data)
- Kubeflow Pipeline β Validates, preprocesses, trains model
- MLflow β Logs experiments, registers model versions
- Model Registry β Promotes model through stages (Dev β Staging β Prod)
- Kubernetes β Serves model via FastAPI endpoints
- Redis β Caches frequent predictions
- Prometheus β Collects metrics from all components
Component Deep-Dives
Component Deep-Dives
Layer 1: MLflow Tracking Server
Purpose: Central hub for experiment tracking and model versioning
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MLflow Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββ βββββββββββββββ β
β β Tracking β β Model β β
β β Server β β Registry β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββββββββββββββββββ β
β β PostgreSQL DB β β
β β β’ Experiment metadata β β
β β β’ Run parameters/metrics β β
β β β’ Model versions β β
β βββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββ β
β β S3 Artifact Store β β
β β β’ Model binaries β β
β β β’ Training artifacts β β
β β β’ Evaluation reports β β
β βββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Model Registry Stages:
- None β Initial upload
- Staging β Passed automated tests
- Production β Approved for live traffic
- Archived β Deprecated versions
Layer 2: Kubeflow 8-Step Pipeline
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kubeflow Pipeline (8 Steps) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β Step 1 ββββΆβ Step 2 ββββΆβ Step 3 ββββΆβ Step 4 β β
β β Data β β Pre- β βTraining β β Eval β β
β β Valid β β process β β (GPU) β β β β
β βββββββββββ βββββββββββ βββββββββββ ββββββ¬βββββ β
β β β
β βΌ β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β Step 8 βββββ Step 7 βββββ Step 6 βββββ Step 5 β β
β β Prod β β Integ β β Staging β βRegister β β
β β Deploy β β Tests β β Deploy β β β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Step | Name | Purpose | Duration |
|---|---|---|---|
| 1 | Data Validation | Schema checks, null detection | 2 min |
| 2 | Preprocessing | Tokenization, train/test split | 5 min |
| 3 | Training | Fine-tune DistilBERT (GPU) | 45 min |
| 4 | Evaluation | Calculate accuracy, F1, confusion matrix | 3 min |
| 5 | Registration | Register model in MLflow if metrics pass | 1 min |
| 6 | Staging Deploy | Deploy to staging environment | 5 min |
| 7 | Integration Tests | Run E2E tests against staging | 10 min |
| 8 | Production Deploy | Blue-green deployment to prod | 5 min |
Total Pipeline Duration: ~76 minutes
Layer 3: Kubernetes Production Serving
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes Serving Architecture β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Internet β Ingress β LoadBalancer β
β β β
β βΌ β
β βββββββββββββββββββββββ β
β β Service (L4) β β
β ββββββββββββ¬βββββββββββ β
β β β
β βββββββββββββββββΌββββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββ βββββββββββ βββββββββββ β
β β Pod 1 β β Pod 2 β β Pod 3 β β
β β FastAPI β β FastAPI β β FastAPI β β
β βDistilBERTβ βDistilBERTβ βDistilBERTβ β
β βββββββββββ βββββββββββ βββββββββββ β
β β β β β
β βββββββββββββββββΌββββββββββββββββ β
β βΌ β
β βββββββββββββββββββββββ β
β β HorizontalPodAuto β β
β β scaler (HPA) β β
β β Min: 2, Max: 5 β β
β β Target CPU: 70% β β
β βββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Pod Specifications:
- CPU: 2 cores (request), 4 cores (limit)
- Memory: 4GB (request), 8GB (limit)
- Model: DistilBERT (~250MB)
- Framework: FastAPI + Uvicorn
Layer 4: Redis Semantic Caching
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Redis Caching Strategy β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Request: "This product is amazing!" β
β β β
β βΌ β
β βββββββββββββββββ β
β β Hash Query β β
β β (SHA256) β β
β βββββββββ¬ββββββββ β
β β β
β βΌ β
β βββββββββββββββββ β
β β Redis Lookup β β
β βββββββββ¬ββββββββ β
β β β
β βββββββββ΄ββββββββ β
β β β β
β Cache Hit Cache Miss β
β (30%) (70%) β
β β β β
β βΌ βΌ β
β Return Run Model β
β Cached Inference β
β (5ms) (400ms) β
β β β
β βΌ β
β Store in Cache β
β (TTL: 1 hour) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Cache Statistics:
- Hit Rate: ~30%
- Latency (cached): 5ms
- Latency (uncached): 400ms
- Average Latency: ~280ms
- TTL: 1 hour
- Max Cache Size: 10,000 entries
Integration Patterns & Code
Integration Patterns & Code Examples
CI/CD Flow
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CI/CD Pipeline Flow β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Developer Push β
β β β
β βΌ β
β βββββββββββ β
β β GitHub ββββββββββββββββββββββββββββββββββββββββ β
β β Actions β β β
β ββββββ¬βββββ β β
β β β β
β βΌ βΌ β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β Lint βββββΆβ Test βββββΆβ Build βββββΆβ Trigger β β
β β Code β β Unit β β Docker β βKubeflow β β
β βββββββββββ βββββββββββ βββββββββββ ββββββ¬βββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Kubeflow Pipeline β β
β β Data Val β Preprocess β Train β Eval β Register β β
β β β Staging β Integration Tests β Production β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Integration Points
- GitHub Actions β Kubeflow: Trigger pipeline on code push
- Kubeflow β MLflow: Log experiments, register models
- MLflow β Kubernetes: Deploy registered models
- Kubernetes β Prometheus: Export metrics
- Prometheus β Grafana: Visualize dashboards
Environment Configuration
| Environment | Purpose | Traffic | Model Version |
|---|---|---|---|
| Development | Testing | 0% | Latest |
| Staging | Pre-production validation | 0% | Staging |
| Production | Live users | 100% | Production |
Scaling & Production Considerations
Scaling & Production Considerations
Blue-Green Deployment Strategy
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Blue-Green Deployment β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Before Deployment: β
β βββββββββββββββββββββββββββββββββββ β
β β LoadBalancer β β
β β β β β
β β βΌ β β
β β βββββββββββββββ ββββββββββββββββ β
β β β BLUE (v1.0) β β GREEN (v1.0)ββ β
β β β Active β β Standby ββ β
β β β 100% β β 0% ββ β
β β βββββββββββββββ ββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββ β
β β
β After Deployment: β
β βββββββββββββββββββββββββββββββββββ β
β β LoadBalancer β β
β β β β β
β β βΌ β β
β β βββββββββββββββ ββββββββββββββββ β
β β β BLUE (v1.0) β β GREEN (v1.1)ββ β
β β β Standby β β Active ββ β
β β β 0% β β 100% ββ β
β β βββββββββββββββ ββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββ β
β β
β Rollback: Switch traffic back to BLUE if issues detected β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Auto-Scaling Configuration
| Metric | Target | Min Pods | Max Pods |
|---|---|---|---|
| CPU Utilization | 70% | 2 | 5 |
| Memory Utilization | 80% | 2 | 5 |
| Request Latency | <500ms p99 | 2 | 5 |
Performance Metrics
| Metric | Value |
|---|---|
| Model Latency (p50) | 350ms |
| Model Latency (p99) | 450ms |
| Cache Hit Rate | 30% |
| Cached Response Time | 5ms |
| Throughput | 100-500 RPS |
| Model Accuracy | 94% |
| Uptime SLA | 99.9% |
Cost Breakdown (Monthly)
| Component | Cost |
|---|---|
| Kubernetes (3-5 pods) | $1,500 |
| GPU Training (spot) | $500 |
| S3 Storage | $100 |
| Redis Cache | $200 |
| Monitoring Stack | $100 |
| MLflow Server | $200 |
| Total | ~$2,600 |
Trade-offs & Alternatives
Trade-offs & Alternatives
Why These Technology Choices?
Kubeflow vs Alternatives
| Feature | Kubeflow | Airflow | Jenkins |
|---|---|---|---|
| ML-Native | Yes | No | No |
| Kubernetes Integration | Native | Add-on | Add-on |
| Experiment Tracking | Built-in | External | External |
| GPU Support | Native | Manual | Manual |
| Learning Curve | Medium | Low | Low |
| Best For | ML Pipelines | Data Pipelines | CI/CD |
Decision: Kubeflow chosen for native ML support and Kubernetes integration.
MLflow vs Alternatives
| Feature | MLflow | Weights & Biases | Neptune |
|---|---|---|---|
| Open Source | Yes | No | No |
| Self-Hosted | Yes | No | No |
| Cost | Free | $$$$ | $$$ |
| Model Registry | Yes | Yes | Yes |
| UI Quality | Good | Excellent | Good |
| Best For | Cost-Sensitive | Enterprise | Research |
Decision: MLflow chosen for cost ($0) and self-hosting capability.
DistilBERT vs Alternatives
| Model | Latency | Accuracy | Size |
|---|---|---|---|
| BERT-base | 800ms | 96% | 440MB |
| DistilBERT | 400ms | 94% | 250MB |
| BERT-tiny | 100ms | 88% | 17MB |
Decision: DistilBERT balances latency and accuracy for production.
Architecture Trade-offs
| Decision | Pros | Cons |
|---|---|---|
| Blue-Green (vs Canary) | Simple rollback, no traffic splitting | 2x resources during deploy |
| Redis Cache (vs in-memory) | Shared across pods, persistent | Network hop latency |
| PostgreSQL for MLflow | Reliability, ACID | Operational overhead |
| Kubernetes HPA (vs custom) | Built-in, well-tested | Limited to CPU/memory metrics |
Future Improvements
- Canary Deployments: Gradual rollout (10% β 50% β 100%)
- Feature Store: Feast for feature management
- A/B Testing: Model comparison in production
- Multi-Model Serving: Triton Inference Server
- Advanced Caching: Semantic similarity-based cache lookup
Your Solution
Try solving the problem first before viewing the solution