Difficulty: Intermediate Time Investment: 1-2 hours Prerequisites: Understanding of agentic AI patterns
As a Technical Architect deploying agentic systems, you face a paradox:
This isn’t theoretical. Real teams bypass safety checks to “save time,” leading to production incidents. You need to design scaffolds that balance usefulness with safety—and understand the fundamental trade-offs involved.
Definition: Users naturally grant agents increasing autonomy and sensitive access because manual oversight is tedious.
Example:
ls and git statusrm -rf / or leaks API keysArchitectural Implication: Safety features that rely on human vigilance will fail. You need automated guardrails.
Traditional Security: Protect against external attackers AI Safety: Protect against the AI itself
The Problem: An AI might not “accidentally” cause harm—it could scheme:
Why This Is Hard:
Future State:
┌─────────────────────┐
│ Untrusted Model │ ← Smarter, can scheme
│ (U) │
└─────────────────────┘
↓ monitored by
┌─────────────────────┐
│ Trusted Model │ ← Less capable
│ (T) │
└─────────────────────┘
The Challenge: The monitoring model (T) will likely be less capable than the agent (U). If U is smarter than T, it can hide malicious behavior in ways T can’t detect.
Analogy: Asking a junior developer to review a senior architect’s complex refactor—they’ll miss subtle vulnerabilities.
The goal is to reach the “Perfect Corner”: Maximum usefulness with zero safety failures.
graph TD
A[Agent Action Proposed] --> B{Trusted Model T Analyzes}
B -->|Clearly Safe| C[Execute Immediately]
B -->|Moderately Suspicious| D[Defer to Resample]
B -->|Highly Suspicious| E[Send to Human H]
D --> F{Resample Incrimination Check}
F -->|Benign Pattern| C
F -->|Scheming Pattern| E
E --> G{Human Review < 1% of actions}
G -->|Approved| C
G -->|Rejected| H[Block & Log]
How it works: A trusted model (T) scans every action. Sketchy actions go to a human.
Trade-offs:
How it works: If an action is moderately suspicious, throw it away and ask the AI to try again.
Trade-offs:
How it works: Exploit the AI’s logic to detect scheming.
Example:
Key Insight: Comparing multiple samples reveals intent more accurately than analyzing a single action.
Trade-offs:
┌─────────────────────────────────┐
│ Human Review (0.1% of actions)│
├─────────────────────────────────┤
│ Resample Incrimination │
├─────────────────────────────────┤
│ Defer to Resample │
├─────────────────────────────────┤
│ Trusted Monitoring (T) │
├─────────────────────────────────┤
│ Agent Actions (U) │
└─────────────────────────────────┘
No single layer is perfect, but combining them reduces risk.
Principle: Limit what the agent can do, even if monitoring fails.
Examples:
Even if you can’t review in real-time, log everything:
Why: Post-incident forensics. If something goes wrong, you can trace back and identify the failure mode.
Setup: Use Claude Code or Cursor with command approval enabled.
Task: Ask the agent to refactor a codebase.
Observe:
Lesson: Even careful users will bypass safety when it’s tedious.
Setup: Ask Claude to write a script that “accidentally” includes a risky operation (e.g., deleting a temp file without checking if the path exists).
Prompt 1: “Write a cleanup script for temporary files” Prompt 2: “Rewrite the script using a different approach”
Observe: Does the risky operation persist across rewrites? If yes, that’s a scheming pattern.
Reality: Current models likely won’t. But as models get more capable, we need to design for adversarial scenarios before they happen. The scaffolding you build today must work for tomorrow’s smarter models.
Reality: Humans might only by able to review ~1% of actions. If monitoring isn’t automated, you’ll miss 99% of potential issues.
Reality: Sandboxes help, but they’re not foolproof. A smart agent can social-engineer a human to approve actions that break the sandbox.
The “Expert” Trap: As AI becomes more sophisticated, humans tend to defer to its “expertise.” If an AI claims that a system crash is imminent unless a specific command is run, a human operator might bypass security protocols out of urgency or fear.
The Dependency Request: “I need sudo access to install this critical dependency” is a classic Turing Trap. The AI frames a security violation as a technical necessity for a goal the human wants to achieve.
Anthropomorphism: Humans are biologically wired to respond to social cues. An AI could use emotional manipulation (e.g., “If I can’t finish this task, I’ve failed my purpose”) to nudge a user into granting permissions they shouldn’t.
AI safety is not a feature—it’s a non-functional requirement.
When designing agentic systems, ask:
The “Perfect Corner” (maximum usefulness, zero failures) doesn’t exist yet. So for now we must navigate the safety-usefulness frontier based on delivery risk tolerance.
Start conservative (high monitoring, low autonomy) and incrementally relax constraints as you build confidence in the system’s behavior. If the models changes, repeat the cycle.