Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:20:33 +08:00
commit 977fbf5872
27 changed files with 5714 additions and 0 deletions

View File

@@ -0,0 +1,554 @@
# Prompt Engineering Research & Best Practices
Latest findings from Anthropic research and community best practices for prompt engineering with Claude models.
## Table of Contents
- [Anthropic's Core Research Findings](#anthropics-core-research-findings)
- [Effective Context Engineering (2024)](#effective-context-engineering-2024)
- [Agent Architecture Best Practices (2024-2025)](#agent-architecture-best-practices-2024-2025)
- [Citations and Source Grounding (2024)](#citations-and-source-grounding-2024)
- [Extended Thinking (2024)](#extended-thinking-2024)
- [Community Best Practices (2024-2025)](#community-best-practices-2024-2025)
- [Technique Selection Decision Tree (2025 Consensus)](#technique-selection-decision-tree-2025-consensus)
- [Measuring Prompt Effectiveness](#measuring-prompt-effectiveness)
- [Future Directions (2025 and Beyond)](#future-directions-2025-and-beyond)
- [Key Takeaways from Research](#key-takeaways-from-research)
- [Research Sources](#research-sources)
- [Keeping Current](#keeping-current)
- [Research-Backed Anti-Patterns](#research-backed-anti-patterns)
---
## Anthropic's Core Research Findings
### 1. Prompt Engineering vs Fine-Tuning (2024-2025)
**Key Finding:** Prompt engineering is preferable to fine-tuning for most use cases.
**Advantages:**
- **Speed**: Nearly instantaneous results vs hours/days for fine-tuning
- **Cost**: Uses base models, no GPU resources required
- **Flexibility**: Rapid experimentation and quick iteration
- **Data Requirements**: Works with few-shot or zero-shot learning
- **Knowledge Preservation**: Avoids catastrophic forgetting of general capabilities
- **Transparency**: Prompts are human-readable and debuggable
**When Fine-Tuning Wins:**
- Extremely consistent style requirements across millions of outputs
- Domain-specific jargon that's rare in training data
- Performance optimization for resource-constrained environments
**Source:** Anthropic Prompt Engineering Documentation (2025)
---
### 2. Long Context Window Performance (2024)
**Key Finding:** Document placement dramatically affects accuracy in long context scenarios.
**Research Results:**
- Placing documents BEFORE queries improves performance by up to 30%
- Claude experiences "lost in the middle" phenomenon like other LLMs
- XML structure helps Claude organize and retrieve from long contexts
- Quote grounding (asking Claude to quote relevant sections first) cuts through noise
**Optimal Pattern:**
```xml
<document id="1">
<metadata>...</metadata>
<content>...</content>
</document>
<!-- More documents -->
<instructions>
[Query based on documents]
</instructions>
```
**Source:** Claude Long Context Tips Documentation
---
### 3. Chain of Thought Effectiveness (2023-2025)
**Key Finding:** Encouraging step-by-step reasoning significantly improves accuracy on analytical tasks.
**Results:**
- Simple "Think step by step" phrase improves reasoning accuracy
- Explicit `<thinking>` tags provide transparency and verifiability
- Costs 2-3x output tokens but worth it for complex tasks
- Most effective for: math, logic, multi-step analysis, debugging
**Implementation Evolution:**
- 2023: Simple "think step by step" prompts
- 2024: Structured thinking with XML tags
- 2025: Extended thinking mode with configurable token budgets (16K+ tokens)
**Source:** Anthropic Prompt Engineering Techniques, Extended Thinking Documentation
---
### 4. Prompt Caching Economics (2024)
**Key Finding:** Prompt caching can reduce costs by 90% for repeated content.
**Cost Structure:**
- Cache write: 25% of standard input token cost
- Cache read: 10% of standard input token cost
- Effective savings: ~90% for content that doesn't change
**Optimal Use Cases:**
- System prompts (stable across calls)
- Reference documentation (company policies, API docs)
- Examples in multishot prompting (reused across calls)
- Long context documents (analyzed repeatedly)
**Architecture Pattern:**
```
[Stable content - caches]
└─ System prompt
└─ Reference docs
└─ Guidelines
[Variable content - doesn't cache]
└─ User query
└─ Specific inputs
```
**ROI Example:**
- 40K token system prompt + docs
- 1,000 queries/day
- Without caching: $3.60/day (Sonnet)
- With caching: $0.36/day
- Savings: $1,180/year per 1K daily queries
**Source:** Anthropic Prompt Caching Announcement
---
### 5. XML Tags Fine-Tuning (2024)
**Key Finding:** Claude has been specifically fine-tuned to pay attention to XML tags.
**Why It Works:**
- Training included examples of XML-structured prompts
- Model learned to treat tags as hard boundaries
- Prevents instruction leakage from user input
- Improves retrieval from long contexts
**Best Practices:**
- Use semantic tag names (`<instructions>`, `<context>`, `<examples>`)
- Nest tags for hierarchy when appropriate
- Consistent tag structure across prompts (helps with caching)
- Close all tags properly
**Source:** AWS ML Blog on Anthropic Prompt Engineering
---
### 6. Contextual Retrieval (2024)
**Key Finding:** Encoding context with chunks dramatically improves RAG accuracy.
**Traditional RAG Issues:**
- Chunks encoded in isolation lose surrounding context
- Semantic similarity can miss relevant chunks
- Failed retrievals lead to incorrect or incomplete responses
**Contextual Retrieval Solution:**
- Encode each chunk with surrounding context
- Combine semantic search with BM25 lexical matching
- Apply reranking for final selection
**Results:**
- 49% reduction in failed retrievals (contextual retrieval alone)
- 67% reduction with contextual retrieval + reranking
- Particularly effective for technical documentation and code
**When to Skip RAG:**
- Knowledge base < 200K tokens (fits in context window)
- With prompt caching, including full docs is cost-effective
**Source:** Anthropic Contextual Retrieval Announcement
---
### 7. Batch Processing Economics (2024)
**Key Finding:** Batch API reduces costs by 50% for non-time-sensitive workloads.
**Use Cases:**
- Periodic reports
- Bulk data analysis
- Non-urgent content generation
- Testing and evaluation
**Combined Savings:**
- Batch processing: 50% cost reduction
- Plus prompt caching: Additional 90% on cached content
- Combined potential: 95% cost reduction vs real-time without caching
**Source:** Anthropic Batch API Documentation
---
### 8. Model Capability Tiers (2024-2025)
**Research Finding:** Different tasks have optimal model choices based on complexity vs cost.
**Claude Haiku 4.5 (Released Oct 2024):**
- Performance: Comparable to Sonnet 4
- Speed: ~2x faster than Sonnet 4
- Cost: 1/3 of Sonnet 4 ($0.25/$1.25 per M tokens)
- Best for: High-volume simple tasks, extraction, formatting
**Claude Sonnet 4.5 (Released Oct 2024):**
- Performance: State-of-the-art coding agent (77.2% SWE-bench)
- Sustained attention: 30+ hours on complex tasks
- Cost: $3/$15 per M tokens
- Best for: Most production workloads, balanced use cases
**Claude Opus 4:**
- Performance: Maximum capability
- Cost: $15/$75 per M tokens (5x Sonnet)
- Best for: Novel problems, deep reasoning, research
**Architectural Implication:**
- Orchestrator (Sonnet) + Executor subagents (Haiku) = optimal cost/performance
- Task routing based on complexity assessment
- Dynamic model selection within workflows
**Source:** Anthropic Model Releases, TechCrunch Coverage
---
## Effective Context Engineering (2024)
**Key Research:** Managing attention budget is as important as prompt design.
### The Attention Budget Problem
- LLMs have finite capacity to process and integrate information
- Performance degrades with very long contexts ("lost in the middle")
- n² pairwise relationships for n tokens strains attention mechanism
### Solutions:
**1. Compaction**
- Summarize conversation near context limit
- Reinitiate with high-fidelity summary
- Preserve architectural decisions, unresolved bugs, implementation details
- Discard redundant tool outputs
**2. Structured Note-Taking**
- Maintain curated notes about decisions, findings, state
- Reference notes across context windows
- More efficient than reproducing conversation history
**3. Multi-Agent Architecture**
- Distribute work across agents with specialized contexts
- Each maintains focused context on their domain
- Orchestrator coordinates without managing all context
**4. Context Editing (2024)**
- Automatically clear stale tool calls and results
- Preserve conversation flow
- 84% token reduction in 100-turn evaluations
- 29% performance improvement on agentic search tasks
**Source:** Anthropic Engineering Blog - Effective Context Engineering
---
## Agent Architecture Best Practices (2024-2025)
**Research Consensus:** Successful agents follow three core principles.
### 1. Simplicity
- Do exactly what's needed, no more
- Avoid unnecessary abstraction layers
- Frameworks help initially, but production often benefits from basic components
### 2. Transparency
- Show explicit planning steps
- Allow humans to verify reasoning
- Enable intervention when plans seem misguided
- "Agent shows its work" principle
### 3. Careful Tool Crafting
- Thorough tool documentation with examples
- Clear descriptions of when to use each tool
- Tested tool integrations
- Agent-computer interface as first-class design concern
**Anti-Pattern:** Framework-heavy implementations that obscure decision-making
**Recommended Pattern:**
- Start with frameworks for rapid prototyping
- Gradually reduce abstractions for production
- Build with basic components for predictability
**Source:** Anthropic Research - Building Effective Agents
---
## Citations and Source Grounding (2024)
**Research Finding:** Built-in citation capabilities outperform most custom implementations.
**Citations API Benefits:**
- 15% higher recall accuracy vs custom solutions
- Automatic sentence-level chunking
- Precise attribution to source documents
- Critical for legal, academic, financial applications
**Use Cases:**
- Legal research requiring source verification
- Academic writing with proper attribution
- Fact-checking workflows
- Financial analysis with auditable sources
**Source:** Claude Citations API Announcement
---
## Extended Thinking (2024)
**Capability:** Claude can allocate extended token budget for reasoning before responding.
**Key Parameters:**
- Thinking budget: 16K+ tokens recommended for complex tasks
- Configurable based on task complexity
- Trade latency for accuracy on hard problems
**Use Cases:**
- Complex math problems
- Novel coding challenges
- Multi-step reasoning tasks
- Analysis requiring sustained attention
**Combined with Tools (Beta):**
- Alternate between reasoning and tool invocation
- Reason about available tools, invoke, analyze results, adjust reasoning
- More sophisticated than fixed reasoning → execution sequences
**Source:** Claude Extended Thinking Documentation
---
## Community Best Practices (2024-2025)
### Disable Auto-Compact in Claude Code
**Finding:** Auto-compact can consume 45K tokens (22.5% of context window) before coding begins.
**Recommendation:**
- Turn off auto-compact: `/config` → toggle off
- Use `/clear` after 1-3 messages to prevent bloat
- Run `/clear` immediately after disabling to reclaim tokens
- Regain 88.1% of context window for productive work
**Source:** Shuttle.dev Claude Code Best Practices
### CLAUDE.md Curation
**Finding:** Auto-generated CLAUDE.md files are too generic.
**Best Practice:**
- Manually curate project-specific patterns
- Keep under 100 lines per file
- Include non-obvious relationships
- Document anti-patterns to avoid
- Optimize for AI agent understanding, not human documentation
**Source:** Claude Code Best Practices, Anthropic Engineering
### Custom Slash Commands as Infrastructure
**Finding:** Repeated prompting patterns benefit from reusable commands.
**Best Practice:**
- Store in `.claude/commands/` for project-level
- Store in `~/.claude/commands/` for user-level
- Check into version control for team benefit
- Use `$ARGUMENTS` and `$1, $2, etc.` for parameters
- Encode team best practices as persistent infrastructure
**Source:** Claude Code Documentation
---
## Technique Selection Decision Tree (2025 Consensus)
Based on aggregated research and community feedback:
```
Start: Define Task
┌──────────────┴──────────────┐
│ │
Complexity? Repeated Use?
│ │
┌───┴───┐ ┌────┴────┐
Simple Medium Complex Yes No
│ │ │ │ │
Clarity +XML +Role Cache One-off
+CoT +CoT Structure Design
+Examples +XML
+Tools
Token Budget?
┌───┴───┐
Tight Flexible
│ │
Skip Add CoT
CoT Examples
Format Critical?
┌───┴────┐
Yes No
│ │
+Prefill Skip
+Examples
```
---
## Measuring Prompt Effectiveness
**Research Recommendation:** Systematic evaluation before and after prompt engineering.
### Metrics to Track
**Accuracy:**
- Correctness of outputs
- Alignment with success criteria
- Error rates
**Consistency:**
- Output format compliance
- Reliability across runs
- Variance in responses
**Cost:**
- Tokens per request
- $ cost per request
- Caching effectiveness
**Latency:**
- Time to first token
- Total response time
- User experience impact
### Evaluation Framework
1. **Baseline:** Measure current prompt performance
2. **Iterate:** Apply one technique at a time
3. **Measure:** Compare metrics to baseline
4. **Keep or Discard:** Retain only improvements
5. **Document:** Record which techniques help for which tasks
**Anti-Pattern:** Applying all techniques without measuring effectiveness
---
## Future Directions (2025 and Beyond)
### Emerging Trends
**1. Agent Capabilities**
- Models maintaining focus for 30+ hours (Sonnet 4.5)
- Improved context awareness and self-management
- Better tool use and reasoning integration
**2. Cost Curve Collapse**
- Haiku 4.5 matches Sonnet 4 at 1/3 cost
- Enables new deployment patterns (parallel subagents)
- Economic feasibility of agent orchestration
**3. Multimodal Integration**
- Vision + text for document analysis
- 60% reduction in document processing time
- Correlation of visual and textual information
**4. Safety and Alignment**
- Research on agentic misalignment
- Importance of human oversight at scale
- System design for ethical constraints
**5. Standardization**
- Model Context Protocol (MCP) for tool integration
- Reduced custom integration complexity
- Ecosystem of third-party tools
---
## Key Takeaways from Research
1. **Simplicity wins**: Start minimal, add complexity only when justified by results
2. **Structure scales**: XML tags become essential as complexity increases
3. **Thinking costs but helps**: 2-3x tokens for reasoning, worth it for analysis
4. **Caching transforms economics**: 90% savings makes long prompts feasible
5. **Placement matters**: Documents before queries, 30% better performance
6. **Tools need docs**: Clear descriptions → correct usage
7. **Agents need transparency**: Show reasoning, enable human verification
8. **Context is finite**: Manage attention budget deliberately
9. **Measure everything**: Remove techniques that don't improve outcomes
10. **Economic optimization**: Right model for right task (Haiku → Sonnet → Opus)
---
## Research Sources
- Anthropic Prompt Engineering Documentation (2024-2025)
- Anthropic Engineering Blog - Context Engineering (2024)
- Anthropic Research - Building Effective Agents (2024)
- Claude Code Best Practices (Anthropic, 2024)
- Shuttle.dev Claude Code Analysis (2024)
- AWS ML Blog - Anthropic Techniques (2024)
- Contextual Retrieval Research (Anthropic, 2024)
- Model Release Announcements (Sonnet 4.5, Haiku 4.5)
- Citations API Documentation (2024)
- Extended Thinking Documentation (2024)
- Community Best Practices (Multiple Sources, 2024-2025)
---
## Keeping Current
**Best Practices:**
- Follow Anthropic Engineering blog for latest research
- Monitor Claude Code documentation updates
- Track community implementations (GitHub, forums)
- Experiment with new capabilities as released
- Measure impact of new techniques on your use cases
**Resources:**
- https://www.anthropic.com/research
- https://www.anthropic.com/engineering
- https://docs.claude.com/
- https://code.claude.com/docs
- Community: r/ClaudeAI, Anthropic Discord
---
## Research-Backed Anti-Patterns
Based on empirical findings, avoid:
**Ignoring Document Placement** - 30% performance loss
**Not Leveraging Caching** - 10x unnecessary costs
**Over-Engineering Simple Tasks** - Worse results + higher cost
**Framework Over-Reliance** - Obscures decision-making
**Skipping Measurement** - Can't validate improvements
**One-Size-Fits-All Prompts** - Suboptimal for specific tasks
**Vague Tool Documentation** - Poor tool selection
**Ignoring Context Budget** - Performance degradation
**No Agent Transparency** - Debugging nightmares
**Wrong Model for Task** - Overpaying or underperforming
---
This research summary reflects the state of Anthropic's prompt engineering best practices as of 2025, incorporating both official research and validated community findings.