architecture-handbook

MLOps Principles & ML System Design

Difficulty: Introductory Time Investment: 2-3 hours Prerequisites: Basic understanding of DevOps and CI/CD


Learning Resources (Start Here)

Primary Video

Deep Dive Learning Path


Why This Matters

You might be asked:

The critical insight: ML systems are 10% machine learning and 90% good engineering practices. The skills you already have (infrastructure, DevOps, monitoring) are more valuable than knowing PyTorch.

Understanding MLOps principles helps you:


Key Concepts

The 10/90 Rule

┌─────────────────────────────────┐
│  ML Engineering Effort          │
├─────────────────────────────────┤
│  10% → Model Training & Tuning  │
│  90% → Infrastructure & Ops     │
└─────────────────────────────────┘

What the 90% includes:

Implication: A team with strong DevOps practices can adopt ML faster than a team with ML expertise but weak engineering discipline.


Research vs. Production Infrastructure

Dimension Research / Training Production / Serving
Scaling Vertical (bigger GPUs) Horizontal (more instances)
Optimization Accuracy Latency & throughput
Hardware GPU-heavy (A100, H100) CPU-friendly (optimised for inference)
Cost Model Burst usage (train once, discard) Always-on (serving 24/7)
Tooling Jupyter, TensorBoard Kubernetes, API gateways

Architect’s Takeaway: Don’t design production systems based on research requirements. Training infrastructure ≠ serving infrastructure.


Principles of Good ML Systems Design

From Chip Huyen’s framework, a good ML system:

1. Solves a Real Problem

2. Is Tested

3. Is Accessible to Users

4. Is Ethical

5. Has Modular, Integrated Components

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Data Pipeline│ -> │ Feature Store│ -> │ Model Serving│
└──────────────┘    └──────────────┘    └──────────────┘
       ↓                    ↓                    ↓
  (versioned)          (cached)            (monitored)

Each component should be independently deployable and testable.

6. Is As Simple As Possible

Why: Complex models are harder to debug, explain, and maintain.

7. Is Transparent

8. Allows for Continuous Integration

9. Is Versioned

10. Is Documented


How It Works: ML System Architecture

graph TD
    A[Raw Data] --> B[Data Validation]
    B --> C[Feature Engineering]
    C --> D[Feature Store]
    D --> E[Model Training]
    E --> F[Model Registry]
    F --> G[Model Serving API]
    G --> H[Prediction Logs]
    H --> I[Monitoring & Alerts]
    I -->|Drift Detected| J[Retrain Trigger]
    J --> E

    K[Live Traffic] --> G
    G --> L[User Response]

Key Insight: ML systems are feedback loops. Unlike traditional software, ML models degrade over time as data patterns shift. You need continuous monitoring and retraining.


Common Approaches (AI Generated)

Approach 1: Managed ML Platforms (AWS SageMaker, Azure ML, GCP Vertex AI)

When to use: You want to focus on models, not infrastructure

Pros:

Cons:

Cost: ££ (pay-per-use can spike, harder to predict monthly costs)


Approach 2: Open-Source MLOps Stack (MLflow, Kubeflow, Airflow)

When to use: You have strong DevOps capability and need customization

Pros:

Cons:

Cost: £ (compute) + £££ (engineering time to build/maintain)

Approach 3: Hybrid (Managed Training + Custom Serving)

When to use: Training is infrequent, but serving requirements are unique

Example: Train models on SageMaker, export to ONNX, serve on custom Kubernetes cluster

Pros:

Cons:

Cost: ££ (training) + £-££ (serving infrastructure) + ££ (integration complexity)


Try It Yourself (AI Generated)

Experiment 1: ML Model Degradation

Setup: Use a simple classification model (e.g., spam detection)

Task:

  1. Train model on 2023 data
  2. Test on 2024 data
  3. Observe accuracy drop

What to observe: How quickly does performance degrade? What causes drift (new slang, new attack patterns)?

Lesson: Models need continuous retraining, not just initial deployment.

Experiment 2: Cost Analysis

Setup: Compare training costs across platforms

Task:

  1. Estimate cost to train a model on SageMaker
  2. Estimate cost on GCP Vertex AI
  3. Estimate cost on self-managed EC2/GCE instances

What to observe: Where’s the cost difference? (Data transfer, GPU pricing, managed service fees)

Lesson: TCO includes operational overhead, not just compute cost.


ML Adoption Strategy

From Chip Huyen’s talk, the recommended progression:

1. Simple Models (Start Here)
   - Logistic regression, decision trees
   - Easy to debug and explain
   - Fast to deploy

2. Optimised Simple Models
   - Better feature engineering
   - More training data
   - Hyperparameter tuning

3. Complex Models (Only If Needed)
   - Deep learning, ensembles
   - When simple models hit a ceiling
   - Higher operational cost

Anti-pattern: Starting with a complex model (e.g., transformer) when a simple rule-based system would work.


Common Pitfalls

Pitfall 1: Ignoring Data Quality

Problem: “Garbage in, garbage out” is 10x worse in ML Solution: Invest in data validation, lineage tracking, and automated checks

Pitfall 2: No Monitoring

Problem: Model accuracy degrades silently in production Solution: Track prediction distribution, feature drift, and business metrics

Pitfall 3: Overengineering

Problem: Deploying Kubernetes + Kubeflow for a model that runs once a day Solution: Start simple (cron job + SQLite), scale when needed

Pitfall 4: Treating ML Like Traditional Software

Problem: “We deployed it, we’re done” Solution: ML requires continuous integration (retraining, A/B testing, monitoring)



Key Takeaway

ML is 90% engineering. If your team has strong DevOps practices, you can adopt ML without hiring a team of PhDs.

When evaluating ML initiatives, ask:

  1. Do we have clean, versioned data?
  2. Can we monitor and retrain models automatically?
  3. Is the ROI worth the operational overhead?
  4. Should we build, buy, or partner?

The opportunity: Many organisations overcomplicate ML. If you apply good engineering discipline, you can deliver ML systems faster and cheaper than teams that focus only on model accuracy.

Next steps: Bookmark Made With ML for hands-on learning if you need to build production ML systems.