Difficulty: Introductory Time Investment: 1-2 hours Prerequisites: Basic familiarity with LLMs
Prompt engineering is the foundation of getting reliable, consistent outputs from LLMs. This requires understanding:
The key insight: Small changes in how you phrase a prompt can dramatically change output quality. Without systematic testing, you’re guessing.
Low Quality High Quality
├──────────────┼──────────────┼──────────────┤
Vague Specific Specific + Specific +
Examples Examples +
Context +
Reasoning
Example Evolution:
Result: Prompt #4 produces consistent, actionable outputs. Prompt #1 is unpredictable.
Principle: Be specific about format, length, and structure.
Bad Example:
Write a function to validate emails
Good Example:
Write a Python function called validate_email that:
- Takes a string as input
- Returns True if valid, False otherwise
- Uses regex to check for: local part, @, domain, and TLD
- Includes docstring with example usage
Why it works: The LM knows exactly what success looks like.
Principle: Show the LM the style or format you want by providing examples.
Use case: When output format is hard to describe explicitly.
Example - Extracting structured data:
Extract the name, email, and role from the following text.
Example:
Input: "John Smith (john@example.com) is our Lead Developer"
Output: {"name": "John Smith", "email": "john@example.com", "role": "Lead Developer"}
Example:
Input: "Contact Sarah Lee at sarah.lee@company.org for DevOps questions"
Output: {"name": "Sarah Lee", "email": "sarah.lee@company.org", "role": "DevOps"}
Now process this:
Input: "Our new architect is Mike Chen (m.chen@tech.io)"
Result: The LM learns the pattern and applies it consistently.
Principle: Reduce hallucinations by giving the LM relevant context and explicit instructions to say “I don’t know.”
Bad Example:
What's our refund policy?
(LM will hallucinate a generic policy)
Good Example:
Use the provided company policy document to answer questions about refunds.
If the answer cannot be found in the document, write "I could not find an answer."
Policy Document:
[paste policy here]
Question: What's our refund policy for digital products?
Why it works: The LM is grounded in real data and has an “escape hatch” for missing info.
Principle: Give the LM time to “think” by asking it to break down the problem.
Basic Prompt:
What is 15% of 240, then add 30?
(Higher error rate on multi-step math)
CoT Prompt:
What is 15% of 240, then add 30?
Think step-by-step:
1. First calculate 15% of 240
2. Then add 30 to that result
Why it works: The LM externalises intermediate steps, reducing errors in complex reasoning.
Modern LMs: Many can do this automatically if you add “Let’s think step-by-step” or “Explain your reasoning.”
Principle: Chain multiple prompts instead of asking for everything at once.
Bad Example (single complex prompt):
Analyze this codebase and suggest improvements to performance, security, and maintainability.
(Overwhelming; high error rate)
Good Example (chained prompts):
Prompt 1: "Analyze this codebase for performance bottlenecks. List the top 3."
Prompt 2: "Now analyze the same codebase for security vulnerabilities. List the top 3."
Prompt 3: "Finally, suggest maintainability improvements based on SOLID principles."
Why it works: Each prompt has a narrow focus, improving accuracy.
Advanced: Many LMs can now chain prompts internally (agentic workflows), but explicit chaining gives you more control.
Principle: Without testing, you can’t improve. Treat prompts like code—version them and measure performance.
How to test prompts:
Example Test Set (Email Validation):
| Input | Expected Output |
|——-|—————-|
| user@example.com | True |
| invalid.email | False |
| user+tag@domain.co.uk | True |
| @example.com | False |
Metrics to track:
Pro tip: LLMs can help evaluate outputs. Use a second LM to score the quality of the first LM’s response.
When to use: Well-defined, straightforward tasks Trade-offs:
When to use: When output format is important Trade-offs:
When to use: Multi-step reasoning, math, logic Trade-offs:
When to use: Complex workflows (e.g., research → summarise → format) Trade-offs:
Task: Ask an LM to “write a function to reverse a string”
Test 3 levels of specificity:
Observe: Does the LM include docstrings? Type hints? Error handling? Which prompt gives you production-ready code?
Task: “Analyze this code for bugs and suggest fixes”
Approach A (single prompt):
Analyze this code for bugs and suggest fixes:
[paste code]
Approach B (chained):
Prompt 1: "Review this code and list potential bugs"
Prompt 2: "For each bug you found, suggest a specific fix"
Observe: Which approach gives more thorough analysis? Which is easier to validate?
Setup: Create a test set of 5 code snippets with known bugs
Task:
Goal: Learn to iterate on prompts based on data, not intuition.
Problem: “Fix the authentication bug” (which bug? which system?) Solution: Provide full context in every prompt
Problem: Prompt works on happy path but fails on edge cases Solution: Build a test set that includes edge cases (empty inputs, special characters, etc.)
Problem: Asking for 10 things in one prompt Solution: Break into smaller, focused prompts
Problem: Using 5 few-shot examples when 2 would work Solution: Test with fewer examples; only add more if accuracy improves significantly
Run the same prompt multiple times and take the majority answer.
Example: For math problems, run the prompt 5 times. If 4 out of 5 give the same answer, that’s likely correct.
Trade-off: 5x cost, but higher accuracy.
Prompt 1: Generate SQL query
Prompt 2: Review the query for syntax errors
Prompt 3: If errors found, regenerate
Why it works: Self-correction improves accuracy.
You are a senior security architect. Review this code for vulnerabilities.
Why it works: LMs adjust their “persona” and prioritise relevant knowledge.
Prompts are code. Treat them with the same rigor:
When evaluating LM-powered tools, ask:
Start simple, measure, iterate. Don’t over-engineer prompts until you have data showing it’s needed.