Files
gh-tachyon-beep-skillpacks-…/skills/using-llm-specialist/prompt-engineering-patterns.md
2025-11-30 08:59:54 +08:00

974 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Prompt Engineering Patterns
## Context
You're writing prompts for an LLM and getting inconsistent or incorrect outputs. Common issues:
- **Vague instructions**: Model guesses intent (inconsistent results)
- **No examples**: Model infers task from description alone (ambiguous)
- **No output format**: Model defaults to prose (unparsable)
- **No reasoning scaffolding**: Model jumps to answer (errors in complex tasks)
- **System message misuse**: Task instructions in system message (inflexible)
**This skill provides effective prompt engineering patterns: specificity, few-shot examples, format specification, chain-of-thought, and proper message structure.**
## Core Principle: Be Specific
**Vague prompts → Inconsistent outputs**
**Bad:**
```
Analyze this review: "Product was okay."
```
**Why bad:**
- "Analyze" is ambiguous (sentiment? quality? topics?)
- No scale specified (1-5? positive/negative?)
- No output format (text? JSON? number?)
**Good:**
```
Rate this review's sentiment on a scale of 1-5:
1 = Very negative
2 = Negative
3 = Neutral
4 = Positive
5 = Very positive
Review: "Product was okay."
Output ONLY the number (1-5):
```
**Result:** Consistent "3" every time
### Specificity Checklist:
**Define the task clearly** (classify, extract, generate, summarize)
**Specify the scale** (1-5, 1-10, percentage, positive/negative/neutral)
**Define edge cases** (null values, ambiguous inputs, relative dates)
**Specify output format** (JSON, CSV, number only, yes/no)
**Set constraints** (max length, required fields, allowed values)
## Prompt Structure
### Message Roles:
**1. System Message:**
```python
system = """
You are an expert Python programmer with 10 years of experience.
You write clean, efficient, well-documented code.
You always follow PEP 8 style guidelines.
"""
```
**Purpose:**
- Sets role/persona (expert, assistant, teacher)
- Defines global behavior (concise, detailed, technical)
- Applies to entire conversation
**Best practices:**
- Keep it short (< 200 words)
- Define WHO the model is, not WHAT to do
- Set tone and constraints
**2. User Message:**
```python
user = """
Write a Python function that calculates the Fibonacci sequence up to n terms.
Requirements:
- Use recursion with memoization
- Include docstring
- Handle edge cases (n <= 0)
- Return list of integers
Output only the code, no explanations.
"""
```
**Purpose:**
- Specific task instructions (per-request)
- Input data
- Output format requirements
**Best practices:**
- Be specific about requirements
- Include examples if ambiguous
- Specify output format explicitly
**3. Assistant Message (in conversation):**
```python
messages = [
{"role": "system", "content": system},
{"role": "user", "content": "Calculate 2+2"},
{"role": "assistant", "content": "4"},
{"role": "user", "content": "Now multiply that by 3"},
]
```
**Purpose:**
- Conversation history
- Shows model previous responses
- Enables multi-turn conversations
## Few-Shot Learning
**Show, don't tell.** Examples teach better than instructions.
### 0-Shot (No Examples):
```
Extract the person, company, and location from this text:
Text: "Tim Cook presented the new iPhone at Apple's Cupertino campus."
```
**Issues:**
- Model guesses format (JSON? Key-value? List?)
- Edge cases unclear (What if no person? Multiple companies?)
### 1-Shot (One Example):
```
Extract entities as JSON.
Example:
Text: "Satya Nadella spoke at Microsoft in Seattle."
Output: {"person": "Satya Nadella", "company": "Microsoft", "location": "Seattle"}
Now extract from:
Text: "Tim Cook presented the new iPhone at Apple's Cupertino campus."
Output:
```
**Better!** Model sees format and structure.
### Few-Shot (3-5 Examples - BEST):
```
Extract entities as JSON.
Example 1:
Text: "Satya Nadella spoke at Microsoft in Seattle."
Output: {"person": "Satya Nadella", "company": "Microsoft", "location": "Seattle"}
Example 2:
Text: "Google announced Gemini in Mountain View."
Output: {"person": null, "company": "Google", "location": "Mountain View"}
Example 3:
Text: "The event took place online with no speakers."
Output: {"person": null, "company": null, "location": "online"}
Now extract from:
Text: "Tim Cook presented the new iPhone at Apple's Cupertino campus."
Output:
```
**Why 3-5 examples?**
- 1 example: Shows format
- 2-3 examples: Shows variation and edge cases
- 4-5 examples: Shows complex patterns
- > 5 examples: Diminishing returns (uses more tokens)
### Few-Shot Best Practices:
1. **Cover edge cases:**
- Null values (missing entities)
- Multiple values (list of people)
- Ambiguous cases (nickname vs full name)
2. **Show desired format consistently:**
- All examples use same structure
- Same field names
- Same data types
3. **Order matters:**
- Put most representative example first
- Put edge cases later
- Model learns from all examples
4. **Balance examples:**
- Show positive and negative cases
- Show simple and complex cases
- Avoid bias (don't show only easy examples)
## Chain-of-Thought (CoT) Prompting
**For reasoning tasks, request step-by-step thinking.**
### Without CoT (Direct):
```
Q: A farmer has 17 sheep. All but 9 die. How many sheep are left?
A:
```
**Output:** "8 sheep" (WRONG! Misread "all but 9")
### With CoT:
```
Q: A farmer has 17 sheep. All but 9 die. How many sheep are left?
Think step-by-step:
1. Start with how many sheep
2. Understand what "all but 9 die" means
3. Calculate remaining sheep
4. State the answer
A:
```
**Output:**
```
1. The farmer starts with 17 sheep
2. "All but 9 die" means all sheep except 9 die
3. So 9 sheep remain alive
4. Answer: 9 sheep
```
**Correct!** CoT catches the trick.
### When to Use CoT:
- ✅ Math word problems
- ✅ Logic puzzles
- ✅ Multi-step reasoning
- ✅ Complex decision-making
- ✅ Ambiguous questions
**Not needed for:**
- ❌ Simple classification (sentiment)
- ❌ Direct lookups (capital of France)
- ❌ Pattern matching (regex, entity extraction)
### CoT Variants:
**1. Explicit steps:**
```
Solve step-by-step:
1. Identify what we know
2. Identify what we need to find
3. Set up the equation
4. Solve
5. Verify the answer
```
**2. "Let's think step by step":**
```
Q: [question]
A: Let's think step by step.
```
**3. "Explain your reasoning":**
```
Q: [question]
A: I'll explain my reasoning:
```
**All three work!** Pick what fits your use case.
## Output Formatting
**Specify format explicitly. Don't assume model knows what you want.**
### JSON Output:
**Bad (no format specified):**
```
Extract the name, age, and occupation from: "John is 30 years old and works as an engineer."
```
**Output:** "The person's name is John, who is 30 years old and works as an engineer."
**Good (format specified):**
```
Extract information as JSON:
Text: "John is 30 years old and works as an engineer."
Output in this format:
{
"name": "<string>",
"age": <number>,
"occupation": "<string>"
}
JSON:
```
**Output:**
```json
{
"name": "John",
"age": 30,
"occupation": "engineer"
}
```
### CSV Output:
```
Convert this data to CSV format with columns: name, age, city.
Data: John is 30 and lives in NYC. Mary is 25 and lives in LA.
CSV (with header):
```
**Output:**
```csv
name,age,city
John,30,NYC
Mary,25,LA
```
### Structured Text:
```
Summarize this article in bullet points (max 5 points):
Article: [text]
Summary:
-
```
**Output:**
```
- Point 1
- Point 2
- Point 3
- Point 4
- Point 5
```
### XML/HTML:
```
Format this data as HTML table:
Data: [data]
HTML:
```
### Format Best Practices:
1. **Show the schema:**
```json
{
"field1": "<type>",
"field2": <type>,
...
}
```
2. **Specify data types:** `<string>`, `<number>`, `<boolean>`, `<array>`
3. **Show example output:** Full example of expected output
4. **Request validation:** "Output valid JSON" or "Ensure CSV is parsable"
## Temperature and Sampling
**Temperature controls randomness. Adjust based on task.**
### Temperature = 0 (Deterministic):
```python
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[...],
temperature=0 # Deterministic, always same output
)
```
**Use for:**
- ✅ Classification (sentiment, category)
- ✅ Extraction (entities, data fields)
- ✅ Structured output (JSON, CSV)
- ✅ Factual queries (capital of X, date of Y)
**Why:** Need consistency and correctness, not creativity
### Temperature = 0.7-1.0 (Creative):
```python
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[...],
temperature=0.8 # Creative, varied outputs
)
```
**Use for:**
- ✅ Creative writing (stories, poems)
- ✅ Brainstorming (ideas, alternatives)
- ✅ Conversational chat (natural dialogue)
- ✅ Content generation (marketing copy)
**Why:** Want variety and creativity, not determinism
### Temperature = 1.5-2.0 (Very Random):
```python
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[...],
temperature=1.8 # Very random, surprising outputs
)
```
**Use for:**
- ✅ Experimental generation
- ✅ Highly creative tasks
**Warning:** May produce nonsensical outputs (use carefully)
### Top-p (Nucleus Sampling):
```python
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[...],
temperature=0.7,
top_p=0.9 # Consider top 90% probability mass
)
```
**Alternative to temperature:**
- top_p = 1.0: Consider all tokens (default)
- top_p = 0.9: Consider top 90% (filters low-probability tokens)
- top_p = 0.5: Consider top 50% (more focused)
**Best practice:** Use temperature OR top_p, not both
## Common Task Patterns
### 1. Classification:
```
Classify the sentiment of this review as 'positive', 'negative', or 'neutral'.
Output ONLY the label.
Review: "The product works great but shipping was slow."
Sentiment:
```
**Key elements:**
- Clear categories ('positive', 'negative', 'neutral')
- Output constraint ("ONLY the label")
- Prompt ends with field name ("Sentiment:")
### 2. Extraction:
```
Extract all dates from this text. Output as JSON array.
Text: "Meeting on March 15, 2024. Follow-up on March 22."
Format:
["YYYY-MM-DD", "YYYY-MM-DD"]
Output:
```
**Key elements:**
- Specific format (JSON array)
- Date format specified (YYYY-MM-DD)
- Shows example structure
### 3. Summarization:
```
Summarize this article in 50 words or less. Focus on the main conclusion and key findings.
Article: [long text]
Summary (max 50 words):
```
**Key elements:**
- Length constraint (50 words)
- Focus instruction (main conclusion, key findings)
- Clear output label
### 4. Generation:
```
Write a product description for a wireless mouse with these features:
- Ergonomic design
- 1600 DPI sensor
- 6-month battery life
- Bluetooth 5.0
Style: Professional, concise (50-100 words)
Product Description:
```
**Key elements:**
- Input data (features list)
- Style guide (professional, concise)
- Length constraint (50-100 words)
### 5. Transformation:
```
Convert this SQL query to Python (using pandas):
SQL:
SELECT name, age FROM users WHERE age > 30 ORDER BY age DESC
Python (pandas):
```
**Key elements:**
- Clear source and target formats
- Shows example input
- Labels expected output
### 6. Question Answering:
```
Answer this question based ONLY on the provided context. If the answer is not in the context, say "I don't know."
Context: [document]
Question: What is the return policy?
Answer:
```
**Key elements:**
- Constraint ("based ONLY on context")
- Fallback instruction ("I don't know")
- Prevents hallucination
## Advanced Techniques
### 1. Self-Consistency:
**Generate multiple outputs, take majority vote.**
```python
answers = []
for _ in range(5):
response = llm.generate(prompt, temperature=0.7)
answers.append(response)
# Take majority vote
final_answer = Counter(answers).most_common(1)[0][0]
```
**Use for:**
- Complex reasoning (math, logic)
- When single answer might be wrong
- Accuracy > cost
**Trade-off:** 5× cost for 10-20% accuracy improvement
### 2. Tree-of-Thoughts:
**Explore multiple reasoning paths, pick best.**
```
Problem: [complex problem]
Let's consider 3 different approaches:
Approach 1: [reasoning path 1]
Approach 2: [reasoning path 2]
Approach 3: [reasoning path 3]
Which approach is best? Evaluate each:
[evaluation]
Best approach: [selection]
Now solve using the best approach:
[solution]
```
**Use for:**
- Complex planning
- Strategic decision-making
- Multiple valid solutions
### 3. ReAct (Reasoning + Acting):
**Interleave reasoning with actions (tool use).**
```
Task: What's the weather in the city where the Eiffel Tower is located?
Thought: I need to find where the Eiffel Tower is located.
Action: Search "Eiffel Tower location"
Observation: The Eiffel Tower is in Paris, France.
Thought: Now I need the weather in Paris.
Action: Weather API call for Paris
Observation: 15°C, partly cloudy
Answer: It's 15°C and partly cloudy in Paris.
```
**Use for:**
- Multi-step tasks with tool use
- Search + reasoning
- API interactions
### 4. Instruction Following:
**Separate instructions from data.**
```
Instructions:
- Extract all email addresses
- Validate format (user@domain.com)
- Remove duplicates
- Sort alphabetically
Data:
[text with emails]
Output (JSON array):
```
**Best practice:** Clearly separate "Instructions" from "Data"
## Debugging Prompts
**If output is wrong, diagnose systematically.**
### Problem 1: Inconsistent outputs
**Diagnosis:**
- Instructions too vague?
- No examples?
- Temperature too high?
**Fix:**
- Add specificity
- Add 3-5 examples
- Set temperature=0
### Problem 2: Wrong format
**Diagnosis:**
- Format not specified?
- Example format missing?
**Fix:**
- Specify format explicitly
- Show example output structure
- End prompt with format label ("JSON:", "CSV:")
### Problem 3: Factual errors
**Diagnosis:**
- Hallucination (model making up facts)?
- No chain-of-thought?
**Fix:**
- Add "based only on provided context"
- Request "cite your sources"
- Add "if unsure, say 'I don't know'"
### Problem 4: Too verbose
**Diagnosis:**
- No length constraint?
- No "output only" instruction?
**Fix:**
- Add word/character limit
- Add "output ONLY the [X], no explanations"
- Show concise examples
### Problem 5: Misses edge cases
**Diagnosis:**
- Edge cases not in examples?
- Instructions don't cover edge cases?
**Fix:**
- Add edge case examples (null, empty, ambiguous)
- Explicitly mention edge case handling
## Prompt Testing
**Test prompts systematically before production.**
### 1. Create test cases:
```python
test_cases = [
# Normal cases
{"input": "...", "expected": "..."},
{"input": "...", "expected": "..."},
# Edge cases
{"input": "", "expected": "null"}, # Empty input
{"input": "...", "expected": "null"}, # Missing data
# Ambiguous cases
{"input": "...", "expected": "..."},
]
```
### 2. Run tests:
```python
for case in test_cases:
output = llm.generate(prompt.format(input=case["input"]))
assert output == case["expected"], f"Failed on {case['input']}"
```
### 3. Measure metrics:
```python
# Accuracy
correct = sum(1 for case in test_cases if output == case["expected"])
accuracy = correct / len(test_cases)
# Consistency (run same input 10 times)
outputs = [llm.generate(prompt) for _ in range(10)]
consistency = len(set(outputs)) == 1 # All outputs identical?
# Latency
import time
start = time.time()
output = llm.generate(prompt)
latency = time.time() - start
```
## Prompt Optimization Workflow
**Iterative improvement process:**
### Step 1: Baseline prompt (simple)
```
Classify sentiment: [text]
```
### Step 2: Test and measure
```python
accuracy = 65% # Too low!
consistency = 40% # Very inconsistent
```
### Step 3: Add specificity
```
Classify sentiment as 'positive', 'negative', or 'neutral'.
Output ONLY the label.
Text: [text]
Sentiment:
```
**Result:** accuracy = 75%, consistency = 80%
### Step 4: Add few-shot examples
```
Classify sentiment as 'positive', 'negative', or 'neutral'.
Examples:
[3 examples]
Text: [text]
Sentiment:
```
**Result:** accuracy = 88%, consistency = 95%
### Step 5: Add edge case handling
```
[Include edge case examples in few-shot]
```
**Result:** accuracy = 92%, consistency = 98%
### Step 6: Optimize for cost/latency
```python
# Reduce examples from 5 to 3 (latency 400ms → 300ms)
# Accuracy still 92%
```
**Final:** accuracy = 92%, consistency = 98%, latency = 300ms
## Prompt Libraries and Templates
**Reusable templates for common tasks.**
### Template 1: Classification
```
Classify {item} as one of: {categories}.
{optional: 3-5 examples}
Output ONLY the category label.
{item}: {input}
Category:
```
### Template 2: Extraction
```
Extract {fields} from the text. Output as JSON.
{optional: 3-5 examples showing format and edge cases}
Text: {input}
JSON:
```
### Template 3: Summarization
```
Summarize this {content_type} in {length} words or less.
Focus on {aspects}.
{content_type}: {input}
Summary ({length} words max):
```
### Template 4: Generation
```
Write {output_type} with these characteristics:
{characteristics}
Style: {style}
Length: {length}
{output_type}:
```
### Template 5: Chain-of-Thought
```
{question}
Think step-by-step:
1. {step_1_prompt}
2. {step_2_prompt}
3. {step_3_prompt}
Answer:
```
**Usage:**
```python
prompt = CLASSIFICATION_TEMPLATE.format(
item="review",
categories="'positive', 'negative', 'neutral'",
input=review_text
)
```
## Anti-Patterns
### Anti-pattern 1: "The model is stupid"
**Wrong:** "The model doesn't understand. I need a better model."
**Right:** "My prompt is ambiguous. Let me add examples and specificity."
**Principle:** 90% of issues are prompt issues, not model issues.
### Anti-pattern 2: "Just run it multiple times"
**Wrong:** "Run 10 times and take the average/majority."
**Right:** "Fix the prompt so it's consistent (temperature=0, specific instructions)."
**Principle:** Consistency should come from the prompt, not multiple runs.
### Anti-pattern 3: "Parse the prose output"
**Wrong:** "I'll extract JSON from the prose with regex."
**Right:** "I'll request JSON output explicitly in the prompt."
**Principle:** Specify format in prompt, don't parse after the fact.
### Anti-pattern 4: "System message for everything"
**Wrong:** Put task instructions in system message.
**Right:** System = role/behavior, User = task/instructions.
**Principle:** System message is global (all requests), user message is per-request.
### Anti-pattern 5: "More tokens = better"
**Wrong:** "I'll write a 1000-word prompt with every detail."
**Right:** "I'll write a concise prompt with 3-5 examples."
**Principle:** Concise + examples > verbose instructions.
## Summary
**Core principles:**
1. **Be specific**: Define scale, edge cases, constraints, output format
2. **Use few-shot**: 3-5 examples teach better than instructions
3. **Specify format**: JSON, CSV, structured text (explicit schema)
4. **Request reasoning**: Chain-of-thought for complex tasks
5. **Correct message structure**: System = role, User = task
**Temperature:**
- 0: Classification, extraction, structured output (deterministic)
- 0.7-1.0: Creative writing, brainstorming (varied)
**Common patterns:**
- Classification: Specify categories, output constraint
- Extraction: Format + examples + edge cases
- Summarization: Length + focus areas
- Generation: Features + style + length
**Advanced:**
- Self-consistency: Multiple runs + majority vote
- Tree-of-thoughts: Multiple reasoning paths
- ReAct: Reasoning + action (tool use)
**Debugging:**
- Inconsistent → Add specificity, examples, temperature=0
- Wrong format → Specify format explicitly with examples
- Factual errors → Add context constraints, chain-of-thought
- Too verbose → Add length limits, "output only"
**Key insight:** Prompts are code. Treat them like code: test, iterate, optimize, version control.