Difficulty: Advanced Time Investment: 3-4 hours Prerequisites: Understanding of ADRs, agent skills, code review processes
The Comprehension Gap: AI coding agents (Cursor, Copilot, Claude Code) enable unprecedented velocity, but they create a systemic risk:
Engineers merge complex AI-generated code they don’t fully understand.
This leads to three critical problems:
For public sector and regulated industries, this is especially critical—software is maintained for decades, and the “audit trail of intent” is mandatory.
Definition: The delta between code velocity (how fast AI generates code) and human comprehension (how well engineers understand it).
┌─────────────────────────────────────┐
│ AI Code Generation Velocity │ ← Exponentially increasing
├─────────────────────────────────────┤
│ │
│ ← COMPREHENSION GAP → │ ← Growing over time
│ │
├─────────────────────────────────────┤
│ Human Comprehension Capacity │ ← Relatively constant
└─────────────────────────────────────┘
Symptoms:
Definition: The enterprise-blessed, standardised approach to common problems.
Examples:
S3Wrapper, never raw boto3”DBConnection pool, not direct connections”AuthMiddleware, not custom JWT parsing”Why AI violates it: LLMs are trained on generic code from the internet, not your organisation’s specific patterns.
ADI quantifies how much a code change deviates from the planned architecture (the “Golden Path”).
Formula:
ADI = (Σ Vi × Si) / T
Where:
Vi = Individual architectural violation
Si = Severity weight (0.0 to 1.0)
T = Total architectural touchpoints in the PR
Result: A percentage representing drift (0% = perfect alignment, 100% = complete divergence).
Scenario: PR adds an S3 file upload feature
Planned Architecture (from ADR):
S3Wrapper for all S3 operationsAI-Generated Code Violations:
| Violation (Vi) | Description | Severity (Si) |
|---|---|---|
| V1: Direct Import | Used raw boto3 instead of S3Wrapper |
0.8 (High Risk) |
| V2: Missing Log | No audit event on upload | 1.0 (Critical) |
| V3: Local Pattern | Custom error handler instead of global | 0.2 (Low Risk) |
Total Architectural Touchpoints (T): 10 (number of relevant rules/patterns)
Calculation:
Weighted Sum = (1 × 0.8) + (1 × 1.0) + (1 × 0.2) = 2.0
ADI = 2.0 / 10 = 0.20 (20%)
| ADI Range | Status | Action |
|---|---|---|
| 0.00 - 0.05 | 🟢 Green (Low Drift) | Auto-approve or light review |
| 0.06 - 0.15 | 🟡 Yellow (Moderate Drift) | Trigger Socratic Review (ask “why?”) |
| > 0.15 | 🔴 Red (Critical Drift) | Block merge, require manual architect review |
Example: ADI = 0.20 → 🔴 Red → PR blocked until violations are addressed.
The severity weights (Si) in the ADI formula are starting points, not universal truths. Calibrate them to your organisation’s priorities and risk tolerance.
Example calibration rationale:
| Violation Type | Severity (Si) | Justification |
|---|---|---|
| Security violations (missing auth, exposed secrets) | 1.0 (Critical) | One breach can cost £M+ in incident response, regulatory fines |
| Performance degradation (N+1 queries, memory leaks) | 0.7-0.8 (High) | Impacts user experience, harder to fix later, compounds over time |
| Golden Path deviations (wrong library, pattern) | 0.5-0.6 (Medium) | Increases maintenance burden, fragments knowledge |
| Code style violations (naming, formatting) | 0.1-0.2 (Low) | Low business impact, easy to fix, mostly aesthetic |
How to calibrate:
severity-weights.md file explaining why each weight was chosen.Important: This is a heuristic, not a science. The value is in consistent measurement, not perfect weights. A team using 0.8 for security vs. another using 1.0 is less important than both teams measuring consistently.
Start simple: Use three tiers (Critical=1.0, High=0.7, Low=0.2) and refine over time.
While ADI helps maintain architectural consistency, it’s not appropriate for all teams and situations:
Skip ADI if:
ADI works best when:
Alternative for smaller teams: Use “Comprehension Verification” (the check-mate) without formal ADI calculation. Simply ask: “Can you explain why this code works?” before merge.
MVA: A lightweight, context-first scoping process that captures architectural intent before code is written.
Purpose: Ensure the “System of Record” for design intent is human-authored, not AI-inferred.
graph TD
A[Feature Request] --> B[Socratic Scoping]
B --> C[Context Capture]
C --> D[Intent Snapshot]
D --> E[Agent Skill References Intent]
E --> F[Code Generation]
F --> G[Drift Detection]
G --> H{ADI Check}
H -->|Low Drift| I[Merge]
H -->|High Drift| J[Comprehension Verification]
J --> K{Engineer Proves Understanding?}
K -->|Yes| I
K -->|No| L[Reject, Request Rework]
Instead of accepting requirements at face value, use Socratic questioning to extract architectural context.
Example Dialogue: A conversation between a Scoratic AI Architect Agent and a user
This log taken from an interaction with Claude Code with the architecture-context-first skill demonstrates how effective this method is at maximising the work not done for the first iteration. A project that could have taken many months to deliver as it was bogged down in hard requirements gets stripped back to the essentials, allowing earlier delivery to validate critical user and business needs before investing more time in unvalidated features.
What it is: An AI generated “Source of Truth” file that captures key architectural decisions for a project.
Example (intent-snapshot.md):
# Feature: API Caching Layer
## Context
- Problem: API response time averaging 800ms (target: <200ms)
- Root cause: Repeated DB queries for rarely-changing data
- Business impact: User churn on mobile (slow UX)
## Architectural Decisions
- **Technology**: Redis (already in stack, team familiar)
- **Cache Strategy**: Cache-aside (lazy loading)
- **TTL**: 5 minutes (acceptable staleness per product team)
- **Fallback**: On cache miss, fetch from DB (no error to user)
- **Eviction**: LRU (Least Recently Used)
## Golden Path Alignment
✅ Uses `RedisWrapper` (not raw `redis-py`)
✅ Triggers `CacheMetrics` event on hit/miss
✅ Uses global `ErrorHandler` for Redis failures
## Success Criteria
- P95 latency < 200ms
- Cache hit rate > 70%
- Zero user-facing errors from cache failures
## Constraints
- Must work offline (mobile app requirement)
- No PII in cache (GDPR compliance)
How it’s used: Agents empowered with Skills reference this file to generate code more aligned with intent.
Traditional code review asks: “Does this code work?”
Better question: “Does the engineer understand why it works?”
Pattern: Before merge, the engineer must answer a high-level logic question about their code.
Not this (syntax check):
Reviewer: "You're missing a null check on line 47"
This (comprehension check):
Reviewer: "If the auth service returns a 403 here, how does your retry logic prevent an infinite loop?"
Gate logic:
OPA (Open Policy Agent): A policy engine that enforces rules in code, not just documentation.
Use case: Prevent AI from introducing non-compliant patterns.
Rule: “All S3 access must use S3Wrapper”
OPA Policy (s3-policy.rego):
package main
deny[msg] {
# Check if code imports boto3 directly
input.imports[_] == "boto3"
# Check if S3Wrapper is NOT imported
not input.imports[_] == "S3Wrapper"
msg := "Direct boto3 import detected. Use S3Wrapper instead."
}
deny[msg] {
# Check for raw boto3.client('s3') calls
contains(input.code, "boto3.client('s3')")
msg := "Raw S3 client detected. Use S3Wrapper.get_client() instead."
}
How it works:
Problem: Hard to see how a change impacts system structure.
Solution: Generate a real-time graph comparing MVA intent vs. actual implementation.
Example:
Planned (from MVA):
[API] → [S3Wrapper] → [S3]
Actual (from code analysis):
[API] → [boto3] → [S3] ← VIOLATION
Tool: Use tools like madge, dependency-cruiser, or custom AST parsers.
drift-detectorStructure:
/drift-detector/
├── SKILL.md
├── references/
│ ├── golden-path.json # Approved patterns
│ └── severity-weights.json # Si values for violations
└── scripts/
├── calculate_adi.py # Compute ADI from AST
├── parse_intent.py # Extract MVA from intent-snapshot.md
└── visualise_drift.py # Generate dependency graph
Step 1: Engineer requests feature
Engineer: "Add user export to CSV feature"
Step 2: Skill triggers Socratic scoping
Skill: "Before we code, let's capture context:
1. What data fields should be exported?
2. Are there privacy concerns (GDPR)?
3. What's the expected file size?
4. Who will consume this CSV (internal/external)?
5. How does this align with our export strategy?"
Step 3: Intent snapshot created
intent-snapshot.md generated with answers
Step 4: AI generates code
Engineer: "Generate CSV export code"
AI: [generates 200 lines]
Step 5: Skill calculates ADI
Skill: Running drift detection...
- V1: Used raw `csv` library instead of `CSVExporter` (Si=0.6)
- V2: No PII redaction (Si=1.0)
- V3: No file size limit (Si=0.8)
ADI = 0.24 (24%) → 🔴 Critical Drift
Step 6: Comprehension check triggered
Skill: "Before merging, please answer:
If a user's data contains 1M rows, how does your code prevent memory overflow?"
Engineer: "Uh, it loads all rows into memory... that's a problem."
Skill: "Correct. The Golden Path uses streaming CSVExporter. Please refactor."
Task: Review your most recent PR
Steps:
Observe: Would your PR have been blocked? What’s the threshold?
Task: Take a vague feature request
Example: “Add notifications to the app”
Practice Socratic questions:
Observe: How much clearer is the requirement after 5-10 questions?
Task: Write a simple policy for your codebase
Example: “All API endpoints must have rate limiting”
OPA Policy:
package main
deny[msg] {
input.route_definition
not input.route_definition.rate_limit
msg := "API endpoint missing rate_limit configuration"
}
Test: Run against your codebase, see violations
Problem: Prioritizing speed over comprehension Solution: Set ADI thresholds and enforce them (block merges)
Problem: Skipping MVA, jumping straight to code Solution: Make Socratic scoping mandatory for new features
Problem: Asking yes/no questions instead of “explain why” Solution: Train reviewers on Socratic questioning techniques
Problem: Patterns exist in tribal knowledge, not code
Solution: Document in golden-path.json, enforce via OPA
| Tool | Focus | Strengths | Gaps |
|---|---|---|---|
| vFunction | Architectural drift detection | Visualises true architecture, monitors drift | Observability-focused (tells you drift happened, doesn’t prevent) |
| CodeRabbit | AI peer review | Deep, context-aware reviews | Recommender, not enforcer (no verification gate) |
| Qodo (Codium) | AI test generation + review | Good at generating tests | Focuses on coverage, not comprehension |
| Traycer | Intent-based review | Detects “veering off intent” | Still prioritises velocity over understanding |
| UST CodeCrafter | Enterprise SDLC | Forces ADR capture, policy-as-code | Service-heavy, platform lock-in |
AI velocity without comprehension = technical debt at scale.
Hypothetical pattern to prevent this:
Hypothetical Success Metrics:
The goal: Ensure AI amplifies human capability without creating unmaintainable “black box” systems.
Remember: In public sector and regulated industries, the “audit trail of intent” is not optional, it’s mandatory. We need patterns and approaches make that audit trail explicit and enforceable without becoming a bottleneck on forward progress.