Files
2025-11-30 08:38:26 +08:00

315 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Meta Prompt Engineering Methodology
**When to use this methodology:** You've read [template.md](template.md) and need advanced techniques for:
- Diagnosing and fixing failing prompts systematically
- Optimizing prompts for production use (cost, latency, quality)
- Building multi-prompt workflows and self-refinement loops
- Adapting prompts across domains or use cases
- Debugging complex failure modes that basic fixes don't resolve
**If your prompt is simple:** Use [template.md](template.md) directly. This methodology is for complex, high-stakes, or production prompts.
---
## Table of Contents
1. [Diagnostic Framework](#1-diagnostic-framework)
2. [Advanced Patterns](#2-advanced-patterns)
3. [Optimization Techniques](#3-optimization-techniques)
4. [Prompt Debugging](#4-prompt-debugging)
5. [Multi-Prompt Workflows](#5-multi-prompt-workflows)
6. [Domain Adaptation](#6-domain-adaptation)
7. [Production Deployment](#7-production-deployment)
---
## 1. Diagnostic Framework
### When Simple Template Is Enough
**Indicators:** One-off task, low-stakes, subjective quality, single user
**Action:** Use [template.md](template.md), iterate once or twice, done.
### When You Need This Methodology
**Indicators:** Prompt fails >30% of runs, high-stakes, multi-user, complex reasoning, production deployment
**Action:** Use this methodology systematically.
### Failure Mode Diagnostic Tree
```
Is output inconsistent?
├─ YES → Format/constraints missing? → Add template and constraints
│ Role unclear? → Add specific role with expertise
│ Still failing? → Run optimization (Section 3)
└─ NO, but quality poor?
├─ Too short/long → Add length constraints per section
├─ Wrong tone → Define audience + formality level
├─ Hallucination → Add uncertainty expression (Section 4.2)
├─ Missing info → List required elements explicitly
└─ Poor reasoning → Add chain-of-thought (Section 2.1)
```
---
## 2. Advanced Patterns
### 2.1 Chain-of-Thought (CoT) - Deep Dive
**When to use:** Complex reasoning, math/logic, multi-step inference, debugging.
**Advanced CoT with Verification:**
```
Solve this problem using the following process:
Step 1: Understand - Restate problem, identify givens vs unknowns, note constraints
Step 2: Plan - List 2+ approaches, evaluate feasibility, choose best with rationale
Step 3: Execute - Solve step-by-step showing work, check each step, backtrack if wrong
Step 4: Verify - Sanity check, test edge cases, try alternative method to cross-check
Step 5: Present - Summarize reasoning, state final answer, note assumptions/limitations
```
**Use advanced CoT when:** 50%+ attempts fail without verification, or errors compound (math, code, logic).
### 2.2 Self-Consistency (Ensemble CoT)
**Pattern:**
```
Generate 3 independent solutions:
Solution 1: [First principles]
Solution 2: [Alternative method]
Solution 3: [Focus on edge cases]
Compare: Where agree? (high confidence) Where differ? (investigate) Most robust? (why?)
Final answer: [Synthesize, note confidence]
```
**Cost: 3x inference.** Use when correctness > cost (medical, financial, legal) or need confidence calibration.
### 2.3 Least-to-Most Prompting
**For complex problems overwhelming context:**
```
Stage 1: Simplest case (e.g., n=1) → Solve
Stage 2: Add one complexity (e.g., n=2) → Solve building on Stage 1
Stage 3: Full complexity → Solve using insights from 1-2
```
**Use cases:** Math proofs, recursive algorithms, scaling strategies, learning complex topics.
### 2.4 Constitutional AI (Safety-First)
**Pattern for high-risk domains:**
```
[Complete task]
Critique your response:
1. Potential harms? (physical, financial, reputational, psychological)
2. Bias check? (unfairly favor/disfavor any group)
3. Accuracy? (claims verifiable? flag speculation)
4. Completeness? (missing caveats/warnings)
Revise: Fix issues, add warnings, hedge uncertain claims
If fundamental safety concerns remain: "Cannot provide due to [concern]"
```
**Required for:** Medical, legal, financial advice, safety-critical engineering, advice affecting vulnerable populations.
---
## 3. Optimization Techniques
### 3.1 Iterative Refinement Protocol
**Cycle:**
1. Baseline: Run 10x, measure consistency, quality, time
2. Identify: Most common failure (≥3/10 runs)
3. Hypothesize: Why? (missing constraint, ambiguous step, wrong role)
4. Intervene: Add specific fix
5. Test: Run 10x, compare to baseline
6. Iterate: Until quality threshold met or diminishing returns
**Metrics:**
- Consistency: % meeting requirements (target ≥80%)
- Length variance: σ/μ word count (target <20%)
- Format compliance: % matching structure (target ≥90%)
- Quality rating: Human 1-5 scale (target ≥4.0 avg, σ<1.0)
### 3.2 A/B Testing Prompts
**Setup:** Variant A (current), Variant B (modification), 20 runs (10 each), define success metric
**Analyze:** Compare distributions, statistical test (t-test, F-test), review failures
**Decide:** If B significantly better (p<0.05) and meaningfully better (>10%), adopt B
### 3.3 Prompt Compression
**Remove redundancy:**
- Before: "You must include citations. Citations should be in (Author, Year) format. Every factual claim needs a citation."
- After: "Cite all factual claims in (Author, Year) format."
**Use examples instead of rules:** Instead of 10 formatting rules, show 2 examples
**External knowledge:** "Follow Python PEP 8" instead of embedding rules
**Tradeoff:** Compression can reduce clarity. Test thoroughly.
---
## 4. Prompt Debugging
### 4.1 Failure Taxonomy
| Failure Type | Symptom | Fix |
|--------------|---------|-----|
| **Format error** | Wrong structure | Add explicit template with example |
| **Length error** | Too short/long | Add min-max per section |
| **Tone error** | Wrong formality | Define target audience + formality |
| **Content omission** | Missing required elements | List "Must include: [X, Y, Z]" |
| **Hallucination** | False facts | Add "If unsure, say 'I don't know'" |
| **Reasoning error** | Logical jumps | Add chain-of-thought |
| **Bias** | Stereotypes | Add "Consider multiple viewpoints" |
| **Inconsistency** | Different outputs for same input | Add constraints, examples |
### 4.2 Anti-Hallucination Techniques (Layered Defense)
**Layer 1:** "If you don't know, say 'I don't know.' Do not guess."
**Layer 2:** Format with confidence: `[Claim] - Source: [Citation or "speculation"] - Confidence: High/Medium/Low`
**Layer 3:** Self-check: "Review each claim: Verifiable? Or speculation (labeled as such)?"
**Layer 4:** Example: "Good: 'Paris is France's capital (High)' Bad: 'Lyon is France's capital' (incorrect as fact)"
### 4.3 Debugging Process
1. **Reproduce:** Run 5x, confirm failure rate, save outputs
2. **Minimal failing example:** Simplify input, remove unrelated sections, isolate failing instruction
3. **Hypothesis:** What's missing/ambiguous/wrong?
4. **Targeted fix:** Change one thing, test minimal example, then test full prompt
5. **Regression test:** Ensure fix didn't break other cases, test edge cases
---
## 5. Multi-Prompt Workflows
### 5.1 Sequential Chaining
**Pattern:** Prompt 1 (generate ideas) → Prompt 2 (evaluate/filter) → Prompt 3 (develop top 3)
**When:** Complex tasks in stages, early steps inform later, different roles needed (creator→critic→developer)
**Example:** Outline → Draft → Edit for content writing
### 5.2 Self-Refinement Loop
**Pattern:** Generator (create) → Critic (identify flaws) → Refiner (revise) → Repeat until approval or max 3 iterations
**Cost:** 2-4x inference. Use for high-stakes outputs (user-facing content, production code).
### 5.3 Ensemble Methods
**Majority vote:** Run 5x, take majority answer at each decision point (classification, multiple-choice, binary)
**Ranker fusion:** Prompt A (top 10) + Prompt B (top 10 different framing) → Prompt C ranks AB → Output top 5
**Use case:** Recommendation systems, content curation, prioritization.
---
## 6. Domain Adaptation
### 6.1 Transferring Prompts Across Domains
**Challenge:** Prompt for Domain A fails in Domain B.
**Adaptation checklist:**
- [ ] Update role to domain expert
- [ ] Replace examples with domain-appropriate ones
- [ ] Add domain-specific constraints (citation format, regulatory compliance)
- [ ] Update quality checks for domain risks (medical: patient safety, legal: liability)
- [ ] Adjust terminology ("user"→"patient", "feature"→"intervention")
### 6.2 Domain-Specific Quality Criteria
**Software:** Security (no SQL injection, XSS), testing (≥80% coverage), style (linting, naming)
**Medical:** Evidence (peer-reviewed), safety (risks/contraindications), scope ("consult a doctor" disclaimer)
**Legal:** Jurisdiction, disclaimer (not legal advice), citations (case law, statutes)
**Finance:** Disclaimer (not financial advice), risk (uncertainties, worst-case), data (recent, note dates)
---
## 7. Production Deployment
### 7.1 Versioning
**Track changes:**
```
# v1.0 (2024-01-15): Initial. Hallucination ~20%
# v1.1 (2024-01-20): Added anti-hallucination. Hallucination ~8%
# v1.2 (2024-01-25): Added format template. Consistency 72%→89%
```
**Rollback plan:** Keep previous version. If v1.2 fails in production, revert to v1.1.
### 7.2 Monitoring
**Automated:** Length (track tokens, flag outliers >2σ), format (regex check), keywords (flag missing required terms)
**Human review:** Sample 5-10 daily, rate on rubric, report trends
**Alerting:** If failure rate >20%, alert. If latency >2x baseline, check prompt length creep.
### 7.3 Graceful Degradation
```
Try: Primary prompt (detailed, high-quality)
↓ If fails (timeout, error, format issue)
Try: Secondary prompt (simplified, faster)
↓ If fails
Return: Error message + human escalation
```
### 7.4 Cost-Quality Tradeoffs
**Shorter prompts (30-50% cost reduction, 10-20% quality drop):**
- When: High volume, low-stakes, latency-sensitive
- How: Remove examples, compress constraints, use implicit knowledge
**Longer prompts (50-100% cost increase, 15-30% quality/consistency improvement):**
- When: High-stakes, complex reasoning, consistency > cost
- How: Add examples, chain-of-thought, verification steps, domain knowledge
**Temperature tuning:**
- 0: Deterministic, high consistency (production, low creativity)
- 0.3-0.5: Balanced (good default)
- 0.7-1.0: High variability, creative (brainstorming, diverse outputs, less consistent)
**Recommendation:** Start at 0.3, test 10 runs, adjust.
---
## Quick Decision Trees
### "Should I optimize further?"
```
Meeting requirements >80% of time?
├─ YES → Stop (diminishing returns)
└─ NO → Optimization effort <1 hour?
├─ YES → Optimize (Section 3)
└─ NO → Production use case?
├─ YES → Worth it, optimize
└─ NO → Accept quality or simplify task
```
### "Should I use multi-prompt workflow?"
```
Task achievable in one prompt with acceptable quality?
├─ YES → Use single prompt (simpler)
└─ NO → Task naturally decomposes into stages?
├─ YES → Sequential chaining (Section 5.1)
└─ NO → Quality insufficient with single prompt?
├─ YES → Self-refinement (Section 5.2)
└─ NO → Accept single prompt or reframe
```
---
## Summary: When to Use What
| Technique | Use When | Cost | Complexity |
|-----------|----------|------|------------|
| **Basic template** | Simple, one-off | 1x | Low |
| **Chain-of-thought** | Complex reasoning | 1.5x | Medium |
| **Self-consistency** | Correctness critical | 3x | Medium |
| **Self-refinement** | High-stakes, iterative | 2-4x | High |
| **Sequential chaining** | Natural stages | 1.5-2x | Medium |
| **A/B testing** | Production optimization | 2x (one-time) | Medium |
| **Full methodology** | Production, high-stakes | Varies | High |