Difficulty: Introductory Time Investment: 2-3 hours Prerequisites: Basic understanding of DevOps and CI/CD
You might be asked:
The critical insight: ML systems are 10% machine learning and 90% good engineering practices. The skills you already have (infrastructure, DevOps, monitoring) are more valuable than knowing PyTorch.
Understanding MLOps principles helps you:
┌─────────────────────────────────┐
│ ML Engineering Effort │
├─────────────────────────────────┤
│ 10% → Model Training & Tuning │
│ 90% → Infrastructure & Ops │
└─────────────────────────────────┘
What the 90% includes:
Implication: A team with strong DevOps practices can adopt ML faster than a team with ML expertise but weak engineering discipline.
| Dimension | Research / Training | Production / Serving |
|---|---|---|
| Scaling | Vertical (bigger GPUs) | Horizontal (more instances) |
| Optimization | Accuracy | Latency & throughput |
| Hardware | GPU-heavy (A100, H100) | CPU-friendly (optimised for inference) |
| Cost Model | Burst usage (train once, discard) | Always-on (serving 24/7) |
| Tooling | Jupyter, TensorBoard | Kubernetes, API gateways |
Architect’s Takeaway: Don’t design production systems based on research requirements. Training infrastructure ≠ serving infrastructure.
From Chip Huyen’s framework, a good ML system:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Data Pipeline│ -> │ Feature Store│ -> │ Model Serving│
└──────────────┘ └──────────────┘ └──────────────┘
↓ ↓ ↓
(versioned) (cached) (monitored)
Each component should be independently deployable and testable.
Why: Complex models are harder to debug, explain, and maintain.
graph TD
A[Raw Data] --> B[Data Validation]
B --> C[Feature Engineering]
C --> D[Feature Store]
D --> E[Model Training]
E --> F[Model Registry]
F --> G[Model Serving API]
G --> H[Prediction Logs]
H --> I[Monitoring & Alerts]
I -->|Drift Detected| J[Retrain Trigger]
J --> E
K[Live Traffic] --> G
G --> L[User Response]
Key Insight: ML systems are feedback loops. Unlike traditional software, ML models degrade over time as data patterns shift. You need continuous monitoring and retraining.
When to use: You want to focus on models, not infrastructure
Pros:
Cons:
Cost: ££ (pay-per-use can spike, harder to predict monthly costs)
When to use: You have strong DevOps capability and need customization
Pros:
Cons:
When to use: Training is infrequent, but serving requirements are unique
Example: Train models on SageMaker, export to ONNX, serve on custom Kubernetes cluster
Pros:
Cons:
Cost: ££ (training) + £-££ (serving infrastructure) + ££ (integration complexity)
Setup: Use a simple classification model (e.g., spam detection)
Task:
What to observe: How quickly does performance degrade? What causes drift (new slang, new attack patterns)?
Lesson: Models need continuous retraining, not just initial deployment.
Setup: Compare training costs across platforms
Task:
What to observe: Where’s the cost difference? (Data transfer, GPU pricing, managed service fees)
Lesson: TCO includes operational overhead, not just compute cost.
From Chip Huyen’s talk, the recommended progression:
1. Simple Models (Start Here)
- Logistic regression, decision trees
- Easy to debug and explain
- Fast to deploy
2. Optimised Simple Models
- Better feature engineering
- More training data
- Hyperparameter tuning
3. Complex Models (Only If Needed)
- Deep learning, ensembles
- When simple models hit a ceiling
- Higher operational cost
Anti-pattern: Starting with a complex model (e.g., transformer) when a simple rule-based system would work.
Problem: “Garbage in, garbage out” is 10x worse in ML Solution: Invest in data validation, lineage tracking, and automated checks
Problem: Model accuracy degrades silently in production Solution: Track prediction distribution, feature drift, and business metrics
Problem: Deploying Kubernetes + Kubeflow for a model that runs once a day Solution: Start simple (cron job + SQLite), scale when needed
Problem: “We deployed it, we’re done” Solution: ML requires continuous integration (retraining, A/B testing, monitoring)
ML is 90% engineering. If your team has strong DevOps practices, you can adopt ML without hiring a team of PhDs.
When evaluating ML initiatives, ask:
The opportunity: Many organisations overcomplicate ML. If you apply good engineering discipline, you can deliver ML systems faster and cheaper than teams that focus only on model accuracy.
Next steps: Bookmark Made With ML for hands-on learning if you need to build production ML systems.