Initial commit
This commit is contained in:
554
skills/engineering-prompts/reference/research.md
Normal file
554
skills/engineering-prompts/reference/research.md
Normal file
@@ -0,0 +1,554 @@
|
||||
# Prompt Engineering Research & Best Practices
|
||||
|
||||
Latest findings from Anthropic research and community best practices for prompt engineering with Claude models.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Anthropic's Core Research Findings](#anthropics-core-research-findings)
|
||||
- [Effective Context Engineering (2024)](#effective-context-engineering-2024)
|
||||
- [Agent Architecture Best Practices (2024-2025)](#agent-architecture-best-practices-2024-2025)
|
||||
- [Citations and Source Grounding (2024)](#citations-and-source-grounding-2024)
|
||||
- [Extended Thinking (2024)](#extended-thinking-2024)
|
||||
- [Community Best Practices (2024-2025)](#community-best-practices-2024-2025)
|
||||
- [Technique Selection Decision Tree (2025 Consensus)](#technique-selection-decision-tree-2025-consensus)
|
||||
- [Measuring Prompt Effectiveness](#measuring-prompt-effectiveness)
|
||||
- [Future Directions (2025 and Beyond)](#future-directions-2025-and-beyond)
|
||||
- [Key Takeaways from Research](#key-takeaways-from-research)
|
||||
- [Research Sources](#research-sources)
|
||||
- [Keeping Current](#keeping-current)
|
||||
- [Research-Backed Anti-Patterns](#research-backed-anti-patterns)
|
||||
|
||||
---
|
||||
|
||||
## Anthropic's Core Research Findings
|
||||
|
||||
### 1. Prompt Engineering vs Fine-Tuning (2024-2025)
|
||||
|
||||
**Key Finding:** Prompt engineering is preferable to fine-tuning for most use cases.
|
||||
|
||||
**Advantages:**
|
||||
- **Speed**: Nearly instantaneous results vs hours/days for fine-tuning
|
||||
- **Cost**: Uses base models, no GPU resources required
|
||||
- **Flexibility**: Rapid experimentation and quick iteration
|
||||
- **Data Requirements**: Works with few-shot or zero-shot learning
|
||||
- **Knowledge Preservation**: Avoids catastrophic forgetting of general capabilities
|
||||
- **Transparency**: Prompts are human-readable and debuggable
|
||||
|
||||
**When Fine-Tuning Wins:**
|
||||
- Extremely consistent style requirements across millions of outputs
|
||||
- Domain-specific jargon that's rare in training data
|
||||
- Performance optimization for resource-constrained environments
|
||||
|
||||
**Source:** Anthropic Prompt Engineering Documentation (2025)
|
||||
|
||||
---
|
||||
|
||||
### 2. Long Context Window Performance (2024)
|
||||
|
||||
**Key Finding:** Document placement dramatically affects accuracy in long context scenarios.
|
||||
|
||||
**Research Results:**
|
||||
- Placing documents BEFORE queries improves performance by up to 30%
|
||||
- Claude experiences "lost in the middle" phenomenon like other LLMs
|
||||
- XML structure helps Claude organize and retrieve from long contexts
|
||||
- Quote grounding (asking Claude to quote relevant sections first) cuts through noise
|
||||
|
||||
**Optimal Pattern:**
|
||||
```xml
|
||||
<document id="1">
|
||||
<metadata>...</metadata>
|
||||
<content>...</content>
|
||||
</document>
|
||||
<!-- More documents -->
|
||||
|
||||
<instructions>
|
||||
[Query based on documents]
|
||||
</instructions>
|
||||
```
|
||||
|
||||
**Source:** Claude Long Context Tips Documentation
|
||||
|
||||
---
|
||||
|
||||
### 3. Chain of Thought Effectiveness (2023-2025)
|
||||
|
||||
**Key Finding:** Encouraging step-by-step reasoning significantly improves accuracy on analytical tasks.
|
||||
|
||||
**Results:**
|
||||
- Simple "Think step by step" phrase improves reasoning accuracy
|
||||
- Explicit `<thinking>` tags provide transparency and verifiability
|
||||
- Costs 2-3x output tokens but worth it for complex tasks
|
||||
- Most effective for: math, logic, multi-step analysis, debugging
|
||||
|
||||
**Implementation Evolution:**
|
||||
- 2023: Simple "think step by step" prompts
|
||||
- 2024: Structured thinking with XML tags
|
||||
- 2025: Extended thinking mode with configurable token budgets (16K+ tokens)
|
||||
|
||||
**Source:** Anthropic Prompt Engineering Techniques, Extended Thinking Documentation
|
||||
|
||||
---
|
||||
|
||||
### 4. Prompt Caching Economics (2024)
|
||||
|
||||
**Key Finding:** Prompt caching can reduce costs by 90% for repeated content.
|
||||
|
||||
**Cost Structure:**
|
||||
- Cache write: 25% of standard input token cost
|
||||
- Cache read: 10% of standard input token cost
|
||||
- Effective savings: ~90% for content that doesn't change
|
||||
|
||||
**Optimal Use Cases:**
|
||||
- System prompts (stable across calls)
|
||||
- Reference documentation (company policies, API docs)
|
||||
- Examples in multishot prompting (reused across calls)
|
||||
- Long context documents (analyzed repeatedly)
|
||||
|
||||
**Architecture Pattern:**
|
||||
```
|
||||
[Stable content - caches]
|
||||
└─ System prompt
|
||||
└─ Reference docs
|
||||
└─ Guidelines
|
||||
|
||||
[Variable content - doesn't cache]
|
||||
└─ User query
|
||||
└─ Specific inputs
|
||||
```
|
||||
|
||||
**ROI Example:**
|
||||
- 40K token system prompt + docs
|
||||
- 1,000 queries/day
|
||||
- Without caching: $3.60/day (Sonnet)
|
||||
- With caching: $0.36/day
|
||||
- Savings: $1,180/year per 1K daily queries
|
||||
|
||||
**Source:** Anthropic Prompt Caching Announcement
|
||||
|
||||
---
|
||||
|
||||
### 5. XML Tags Fine-Tuning (2024)
|
||||
|
||||
**Key Finding:** Claude has been specifically fine-tuned to pay attention to XML tags.
|
||||
|
||||
**Why It Works:**
|
||||
- Training included examples of XML-structured prompts
|
||||
- Model learned to treat tags as hard boundaries
|
||||
- Prevents instruction leakage from user input
|
||||
- Improves retrieval from long contexts
|
||||
|
||||
**Best Practices:**
|
||||
- Use semantic tag names (`<instructions>`, `<context>`, `<examples>`)
|
||||
- Nest tags for hierarchy when appropriate
|
||||
- Consistent tag structure across prompts (helps with caching)
|
||||
- Close all tags properly
|
||||
|
||||
**Source:** AWS ML Blog on Anthropic Prompt Engineering
|
||||
|
||||
---
|
||||
|
||||
### 6. Contextual Retrieval (2024)
|
||||
|
||||
**Key Finding:** Encoding context with chunks dramatically improves RAG accuracy.
|
||||
|
||||
**Traditional RAG Issues:**
|
||||
- Chunks encoded in isolation lose surrounding context
|
||||
- Semantic similarity can miss relevant chunks
|
||||
- Failed retrievals lead to incorrect or incomplete responses
|
||||
|
||||
**Contextual Retrieval Solution:**
|
||||
- Encode each chunk with surrounding context
|
||||
- Combine semantic search with BM25 lexical matching
|
||||
- Apply reranking for final selection
|
||||
|
||||
**Results:**
|
||||
- 49% reduction in failed retrievals (contextual retrieval alone)
|
||||
- 67% reduction with contextual retrieval + reranking
|
||||
- Particularly effective for technical documentation and code
|
||||
|
||||
**When to Skip RAG:**
|
||||
- Knowledge base < 200K tokens (fits in context window)
|
||||
- With prompt caching, including full docs is cost-effective
|
||||
|
||||
**Source:** Anthropic Contextual Retrieval Announcement
|
||||
|
||||
---
|
||||
|
||||
### 7. Batch Processing Economics (2024)
|
||||
|
||||
**Key Finding:** Batch API reduces costs by 50% for non-time-sensitive workloads.
|
||||
|
||||
**Use Cases:**
|
||||
- Periodic reports
|
||||
- Bulk data analysis
|
||||
- Non-urgent content generation
|
||||
- Testing and evaluation
|
||||
|
||||
**Combined Savings:**
|
||||
- Batch processing: 50% cost reduction
|
||||
- Plus prompt caching: Additional 90% on cached content
|
||||
- Combined potential: 95% cost reduction vs real-time without caching
|
||||
|
||||
**Source:** Anthropic Batch API Documentation
|
||||
|
||||
---
|
||||
|
||||
### 8. Model Capability Tiers (2024-2025)
|
||||
|
||||
**Research Finding:** Different tasks have optimal model choices based on complexity vs cost.
|
||||
|
||||
**Claude Haiku 4.5 (Released Oct 2024):**
|
||||
- Performance: Comparable to Sonnet 4
|
||||
- Speed: ~2x faster than Sonnet 4
|
||||
- Cost: 1/3 of Sonnet 4 ($0.25/$1.25 per M tokens)
|
||||
- Best for: High-volume simple tasks, extraction, formatting
|
||||
|
||||
**Claude Sonnet 4.5 (Released Oct 2024):**
|
||||
- Performance: State-of-the-art coding agent (77.2% SWE-bench)
|
||||
- Sustained attention: 30+ hours on complex tasks
|
||||
- Cost: $3/$15 per M tokens
|
||||
- Best for: Most production workloads, balanced use cases
|
||||
|
||||
**Claude Opus 4:**
|
||||
- Performance: Maximum capability
|
||||
- Cost: $15/$75 per M tokens (5x Sonnet)
|
||||
- Best for: Novel problems, deep reasoning, research
|
||||
|
||||
**Architectural Implication:**
|
||||
- Orchestrator (Sonnet) + Executor subagents (Haiku) = optimal cost/performance
|
||||
- Task routing based on complexity assessment
|
||||
- Dynamic model selection within workflows
|
||||
|
||||
**Source:** Anthropic Model Releases, TechCrunch Coverage
|
||||
|
||||
---
|
||||
|
||||
## Effective Context Engineering (2024)
|
||||
|
||||
**Key Research:** Managing attention budget is as important as prompt design.
|
||||
|
||||
### The Attention Budget Problem
|
||||
- LLMs have finite capacity to process and integrate information
|
||||
- Performance degrades with very long contexts ("lost in the middle")
|
||||
- n² pairwise relationships for n tokens strains attention mechanism
|
||||
|
||||
### Solutions:
|
||||
|
||||
**1. Compaction**
|
||||
- Summarize conversation near context limit
|
||||
- Reinitiate with high-fidelity summary
|
||||
- Preserve architectural decisions, unresolved bugs, implementation details
|
||||
- Discard redundant tool outputs
|
||||
|
||||
**2. Structured Note-Taking**
|
||||
- Maintain curated notes about decisions, findings, state
|
||||
- Reference notes across context windows
|
||||
- More efficient than reproducing conversation history
|
||||
|
||||
**3. Multi-Agent Architecture**
|
||||
- Distribute work across agents with specialized contexts
|
||||
- Each maintains focused context on their domain
|
||||
- Orchestrator coordinates without managing all context
|
||||
|
||||
**4. Context Editing (2024)**
|
||||
- Automatically clear stale tool calls and results
|
||||
- Preserve conversation flow
|
||||
- 84% token reduction in 100-turn evaluations
|
||||
- 29% performance improvement on agentic search tasks
|
||||
|
||||
**Source:** Anthropic Engineering Blog - Effective Context Engineering
|
||||
|
||||
---
|
||||
|
||||
## Agent Architecture Best Practices (2024-2025)
|
||||
|
||||
**Research Consensus:** Successful agents follow three core principles.
|
||||
|
||||
### 1. Simplicity
|
||||
- Do exactly what's needed, no more
|
||||
- Avoid unnecessary abstraction layers
|
||||
- Frameworks help initially, but production often benefits from basic components
|
||||
|
||||
### 2. Transparency
|
||||
- Show explicit planning steps
|
||||
- Allow humans to verify reasoning
|
||||
- Enable intervention when plans seem misguided
|
||||
- "Agent shows its work" principle
|
||||
|
||||
### 3. Careful Tool Crafting
|
||||
- Thorough tool documentation with examples
|
||||
- Clear descriptions of when to use each tool
|
||||
- Tested tool integrations
|
||||
- Agent-computer interface as first-class design concern
|
||||
|
||||
**Anti-Pattern:** Framework-heavy implementations that obscure decision-making
|
||||
|
||||
**Recommended Pattern:**
|
||||
- Start with frameworks for rapid prototyping
|
||||
- Gradually reduce abstractions for production
|
||||
- Build with basic components for predictability
|
||||
|
||||
**Source:** Anthropic Research - Building Effective Agents
|
||||
|
||||
---
|
||||
|
||||
## Citations and Source Grounding (2024)
|
||||
|
||||
**Research Finding:** Built-in citation capabilities outperform most custom implementations.
|
||||
|
||||
**Citations API Benefits:**
|
||||
- 15% higher recall accuracy vs custom solutions
|
||||
- Automatic sentence-level chunking
|
||||
- Precise attribution to source documents
|
||||
- Critical for legal, academic, financial applications
|
||||
|
||||
**Use Cases:**
|
||||
- Legal research requiring source verification
|
||||
- Academic writing with proper attribution
|
||||
- Fact-checking workflows
|
||||
- Financial analysis with auditable sources
|
||||
|
||||
**Source:** Claude Citations API Announcement
|
||||
|
||||
---
|
||||
|
||||
## Extended Thinking (2024)
|
||||
|
||||
**Capability:** Claude can allocate extended token budget for reasoning before responding.
|
||||
|
||||
**Key Parameters:**
|
||||
- Thinking budget: 16K+ tokens recommended for complex tasks
|
||||
- Configurable based on task complexity
|
||||
- Trade latency for accuracy on hard problems
|
||||
|
||||
**Use Cases:**
|
||||
- Complex math problems
|
||||
- Novel coding challenges
|
||||
- Multi-step reasoning tasks
|
||||
- Analysis requiring sustained attention
|
||||
|
||||
**Combined with Tools (Beta):**
|
||||
- Alternate between reasoning and tool invocation
|
||||
- Reason about available tools, invoke, analyze results, adjust reasoning
|
||||
- More sophisticated than fixed reasoning → execution sequences
|
||||
|
||||
**Source:** Claude Extended Thinking Documentation
|
||||
|
||||
---
|
||||
|
||||
## Community Best Practices (2024-2025)
|
||||
|
||||
### Disable Auto-Compact in Claude Code
|
||||
|
||||
**Finding:** Auto-compact can consume 45K tokens (22.5% of context window) before coding begins.
|
||||
|
||||
**Recommendation:**
|
||||
- Turn off auto-compact: `/config` → toggle off
|
||||
- Use `/clear` after 1-3 messages to prevent bloat
|
||||
- Run `/clear` immediately after disabling to reclaim tokens
|
||||
- Regain 88.1% of context window for productive work
|
||||
|
||||
**Source:** Shuttle.dev Claude Code Best Practices
|
||||
|
||||
### CLAUDE.md Curation
|
||||
|
||||
**Finding:** Auto-generated CLAUDE.md files are too generic.
|
||||
|
||||
**Best Practice:**
|
||||
- Manually curate project-specific patterns
|
||||
- Keep under 100 lines per file
|
||||
- Include non-obvious relationships
|
||||
- Document anti-patterns to avoid
|
||||
- Optimize for AI agent understanding, not human documentation
|
||||
|
||||
**Source:** Claude Code Best Practices, Anthropic Engineering
|
||||
|
||||
### Custom Slash Commands as Infrastructure
|
||||
|
||||
**Finding:** Repeated prompting patterns benefit from reusable commands.
|
||||
|
||||
**Best Practice:**
|
||||
- Store in `.claude/commands/` for project-level
|
||||
- Store in `~/.claude/commands/` for user-level
|
||||
- Check into version control for team benefit
|
||||
- Use `$ARGUMENTS` and `$1, $2, etc.` for parameters
|
||||
- Encode team best practices as persistent infrastructure
|
||||
|
||||
**Source:** Claude Code Documentation
|
||||
|
||||
---
|
||||
|
||||
## Technique Selection Decision Tree (2025 Consensus)
|
||||
|
||||
Based on aggregated research and community feedback:
|
||||
|
||||
```
|
||||
Start: Define Task
|
||||
↓
|
||||
┌──────────────┴──────────────┐
|
||||
│ │
|
||||
Complexity? Repeated Use?
|
||||
│ │
|
||||
┌───┴───┐ ┌────┴────┐
|
||||
Simple Medium Complex Yes No
|
||||
│ │ │ │ │
|
||||
Clarity +XML +Role Cache One-off
|
||||
+CoT +CoT Structure Design
|
||||
+Examples +XML
|
||||
+Tools
|
||||
|
||||
Token Budget?
|
||||
│
|
||||
┌───┴───┐
|
||||
Tight Flexible
|
||||
│ │
|
||||
Skip Add CoT
|
||||
CoT Examples
|
||||
|
||||
Format Critical?
|
||||
│
|
||||
┌───┴────┐
|
||||
Yes No
|
||||
│ │
|
||||
+Prefill Skip
|
||||
+Examples
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Measuring Prompt Effectiveness
|
||||
|
||||
**Research Recommendation:** Systematic evaluation before and after prompt engineering.
|
||||
|
||||
### Metrics to Track
|
||||
|
||||
**Accuracy:**
|
||||
- Correctness of outputs
|
||||
- Alignment with success criteria
|
||||
- Error rates
|
||||
|
||||
**Consistency:**
|
||||
- Output format compliance
|
||||
- Reliability across runs
|
||||
- Variance in responses
|
||||
|
||||
**Cost:**
|
||||
- Tokens per request
|
||||
- $ cost per request
|
||||
- Caching effectiveness
|
||||
|
||||
**Latency:**
|
||||
- Time to first token
|
||||
- Total response time
|
||||
- User experience impact
|
||||
|
||||
### Evaluation Framework
|
||||
|
||||
1. **Baseline:** Measure current prompt performance
|
||||
2. **Iterate:** Apply one technique at a time
|
||||
3. **Measure:** Compare metrics to baseline
|
||||
4. **Keep or Discard:** Retain only improvements
|
||||
5. **Document:** Record which techniques help for which tasks
|
||||
|
||||
**Anti-Pattern:** Applying all techniques without measuring effectiveness
|
||||
|
||||
---
|
||||
|
||||
## Future Directions (2025 and Beyond)
|
||||
|
||||
### Emerging Trends
|
||||
|
||||
**1. Agent Capabilities**
|
||||
- Models maintaining focus for 30+ hours (Sonnet 4.5)
|
||||
- Improved context awareness and self-management
|
||||
- Better tool use and reasoning integration
|
||||
|
||||
**2. Cost Curve Collapse**
|
||||
- Haiku 4.5 matches Sonnet 4 at 1/3 cost
|
||||
- Enables new deployment patterns (parallel subagents)
|
||||
- Economic feasibility of agent orchestration
|
||||
|
||||
**3. Multimodal Integration**
|
||||
- Vision + text for document analysis
|
||||
- 60% reduction in document processing time
|
||||
- Correlation of visual and textual information
|
||||
|
||||
**4. Safety and Alignment**
|
||||
- Research on agentic misalignment
|
||||
- Importance of human oversight at scale
|
||||
- System design for ethical constraints
|
||||
|
||||
**5. Standardization**
|
||||
- Model Context Protocol (MCP) for tool integration
|
||||
- Reduced custom integration complexity
|
||||
- Ecosystem of third-party tools
|
||||
|
||||
---
|
||||
|
||||
## Key Takeaways from Research
|
||||
|
||||
1. **Simplicity wins**: Start minimal, add complexity only when justified by results
|
||||
2. **Structure scales**: XML tags become essential as complexity increases
|
||||
3. **Thinking costs but helps**: 2-3x tokens for reasoning, worth it for analysis
|
||||
4. **Caching transforms economics**: 90% savings makes long prompts feasible
|
||||
5. **Placement matters**: Documents before queries, 30% better performance
|
||||
6. **Tools need docs**: Clear descriptions → correct usage
|
||||
7. **Agents need transparency**: Show reasoning, enable human verification
|
||||
8. **Context is finite**: Manage attention budget deliberately
|
||||
9. **Measure everything**: Remove techniques that don't improve outcomes
|
||||
10. **Economic optimization**: Right model for right task (Haiku → Sonnet → Opus)
|
||||
|
||||
---
|
||||
|
||||
## Research Sources
|
||||
|
||||
- Anthropic Prompt Engineering Documentation (2024-2025)
|
||||
- Anthropic Engineering Blog - Context Engineering (2024)
|
||||
- Anthropic Research - Building Effective Agents (2024)
|
||||
- Claude Code Best Practices (Anthropic, 2024)
|
||||
- Shuttle.dev Claude Code Analysis (2024)
|
||||
- AWS ML Blog - Anthropic Techniques (2024)
|
||||
- Contextual Retrieval Research (Anthropic, 2024)
|
||||
- Model Release Announcements (Sonnet 4.5, Haiku 4.5)
|
||||
- Citations API Documentation (2024)
|
||||
- Extended Thinking Documentation (2024)
|
||||
- Community Best Practices (Multiple Sources, 2024-2025)
|
||||
|
||||
---
|
||||
|
||||
## Keeping Current
|
||||
|
||||
**Best Practices:**
|
||||
- Follow Anthropic Engineering blog for latest research
|
||||
- Monitor Claude Code documentation updates
|
||||
- Track community implementations (GitHub, forums)
|
||||
- Experiment with new capabilities as released
|
||||
- Measure impact of new techniques on your use cases
|
||||
|
||||
**Resources:**
|
||||
- https://www.anthropic.com/research
|
||||
- https://www.anthropic.com/engineering
|
||||
- https://docs.claude.com/
|
||||
- https://code.claude.com/docs
|
||||
- Community: r/ClaudeAI, Anthropic Discord
|
||||
|
||||
---
|
||||
|
||||
## Research-Backed Anti-Patterns
|
||||
|
||||
Based on empirical findings, avoid:
|
||||
|
||||
❌ **Ignoring Document Placement** - 30% performance loss
|
||||
❌ **Not Leveraging Caching** - 10x unnecessary costs
|
||||
❌ **Over-Engineering Simple Tasks** - Worse results + higher cost
|
||||
❌ **Framework Over-Reliance** - Obscures decision-making
|
||||
❌ **Skipping Measurement** - Can't validate improvements
|
||||
❌ **One-Size-Fits-All Prompts** - Suboptimal for specific tasks
|
||||
❌ **Vague Tool Documentation** - Poor tool selection
|
||||
❌ **Ignoring Context Budget** - Performance degradation
|
||||
❌ **No Agent Transparency** - Debugging nightmares
|
||||
❌ **Wrong Model for Task** - Overpaying or underperforming
|
||||
|
||||
---
|
||||
|
||||
This research summary reflects the state of Anthropic's prompt engineering best practices as of 2025, incorporating both official research and validated community findings.
|
||||
Reference in New Issue
Block a user