Initial commit

2025-11-30 08:43:48 +08:00
commit cf118c4923
27 changed files with 10878 additions and 0 deletions
--- a/skills/wolf-verification/SKILL.md
+++ b/skills/wolf-verification/SKILL.md
@@ -0,0 +1,851 @@
+---
+name: wolf-verification
+description: Three-layer verification architecture (CoVe, HSP, RAG) for self-verification, fact-checking, and hallucination prevention
+version: 1.1.0
+category: quality-assurance
+triggers:
+  - verification
+  - fact checking
+  - hallucination detection
+  - self verification
+  - CoVe
+  - HSP
+  - RAG grounding
+dependencies:
+  - wolf-principles
+  - wolf-governance
+size: large
+---
+
+# Wolf Verification Framework
+
+Three-layer verification architecture for self-verification, systematic fact-checking, and hallucination prevention. Based on ADR-043 (Verification Architecture).
+
+## Overview
+
+Wolf's verification system combines three complementary approaches:
+
+1. **CoVe (Chain of Verification)** - Systematic fact-checking by breaking claims into verifiable steps
+2. **HSP (Hierarchical Safety Prompts)** - Multi-level safety validation with security-first design
+3. **RAG Grounding** - Contextual evidence retrieval for claims validation
+4. **Verification-First** - Generate verification checklist BEFORE creating response
+
+**Key Principle**: Agents **MUST** verify their own outputs before delivery. We have built sophisticated self-verification tools - use them!
+
+---
+
+## 🔍 Chain of Verification (CoVe)
+
+### Purpose
+Systematic fact-checking framework that breaks complex claims into independently verifiable atomic steps, creating transparent audit trails.
+
+### Core Concepts
+
+#### Step Types (5 types)
+```python
+class StepType(Enum):
+    FACTUAL = "factual"           # Verifiable against authoritative sources
+    LOGICAL = "logical"            # Reasoning steps following from conclusions
+    COMPUTATIONAL = "computational" # Mathematical/algorithmic calculations
+    OBSERVATIONAL = "observational" # Requires empirical validation
+    DEFINITIONAL = "definitional"   # Meanings, categories, classifications
+```
+
+#### Step Artifacts
+Each verification step produces structured output:
+```python
+StepArtifact(
+    type=StepType.FACTUAL,
+    claim="Einstein was born in Germany",
+    confidence=0.95,
+    reasoning="Verified against biographical records",
+    evidence_sources=["biography.com", "britannica.com"],
+    dependencies=[]  # List of prerequisite step IDs
+)
+```
+
+### Basic Usage
+
+```python
+from src.verification.cove import plan, verify, render, verify_claim
+
+# Simple end-to-end verification
+claim = "Albert Einstein won the Nobel Prize in Physics in 1921"
+report = await verify_claim(claim)
+print(f"Confidence: {report.overall_confidence:.1%}")
+print(report.summary)
+
+# Step-by-step approach
+verification_plan = plan(claim)  # Break into atomic steps
+report = await verify(verification_plan)  # Execute verification
+markdown_audit = render(report, "markdown")  # Generate audit trail
+```
+
+### Advanced Usage with Custom Evidence Provider
+
+```python
+from src.verification.cove import CoVeVerifier, EvidenceProvider, StepType
+
+class WikipediaEvidenceProvider(EvidenceProvider):
+    async def get_evidence(self, claim: str, step_type: StepType):
+        # Your Wikipedia API integration
+        return [{
+            "source": "wikipedia",
+            "content": f"Wikipedia evidence for: {claim}",
+            "confidence": 0.85,
+            "url": f"https://wikipedia.org/search?q={claim}"
+        }]
+
+    def get_reliability_score(self):
+        return 0.9  # Provider reputation (0.0-1.0)
+
+# Use custom provider
+config = CoVeConfig()
+verifier = CoVeVerifier(config, WikipediaEvidenceProvider())
+report = await verifier.verify(verification_plan)
+```
+
+### Verification Workflow
+
+```
+1. PLAN: Break claim into atomic verifiable steps
+   └─> Identify step types (factual, logical, etc.)
+   └─> Establish step dependencies
+   └─> Assign verification methods
+
+2. VERIFY: Execute verification for each step
+   └─> Gather evidence from providers
+   └─> Validate against sources
+   └─> Compute confidence scores
+   └─> Track dependencies
+
+3. AGGREGATE: Combine step results
+   └─> Calculate overall confidence
+   └─> Identify weak points
+   └─> Generate audit trail
+
+4. RENDER: Produce verification report
+   └─> Markdown format for humans
+   └─> JSON format for automation
+   └─> Summary statistics
+```
+
+### Performance Targets
+- **Speed**: <200ms per verification (5 steps average)
+- **Accuracy**: ≥90% citation coverage target
+- **Thoroughness**: All factual claims verified
+
+### When to Use CoVe
+- ✅ Factual claims about events, dates, names
+- ✅ Technical specifications or API documentation
+- ✅ Historical facts or timelines
+- ✅ Statistical claims or metrics
+- ✅ Citations or references
+- ❌ Subjective opinions or preferences
+- ❌ Future predictions (use "hypothetical" step type)
+- ❌ Creative content (fiction, poetry)
+
+**Implementation Location**: `/src/verification/cove.py`
+**Documentation**: `/docs/public/verification/chain-of-verification.md`
+
+---
+
+## 🛡️ Hierarchical Safety Prompts (HSP)
+
+### Purpose
+Multi-level safety validation system with sentence-level claim extraction, entity disambiguation, and security-first design.
+
+### Key Features
+- **Sentence-level Processing**: Extract and validate claims from individual sentences
+- **Entity Disambiguation**: Handle entity linking with disambiguation support
+- **Security-First Design**: Built-in DoS protection, injection detection, input sanitization
+- **Claim Graph Generation**: Structured relationships between claims
+- **Evidence Integration**: Multiple evidence providers with reputation scoring
+- **Performance**: ~500 claims/second, <10ms validation per claim
+
+### Security Layers
+
+#### Level 1: Input Sanitization
+```python
+# Automatic protections
+- Unicode normalization (NFKC)
+- Control character removal
+- Pattern detection (script injection, path traversal)
+- Length limits (prevent resource exhaustion)
+- Timeout protection (configurable limits)
+```
+
+#### Level 2: Sentence Extraction
+```python
+from src.verification.hsp_check import extract_sentences
+
+text = """
+John works at Google. The company was founded in 1998.
+Google's headquarters is in Mountain View, California.
+"""
+
+sentences = extract_sentences(text)
+# Result: [
+#   "John works at Google",
+#   "The company was founded in 1998",
+#   "Google's headquarters is in Mountain View, California"
+# ]
+```
+
+#### Level 3: Claim Building
+```python
+from src.verification.hsp_check import build_claims
+
+claims = build_claims(sentences)
+# Result: List[Claim] with extracted entities and relationships
+# Each claim has:
+# - claim_text: str
+# - entities: List[Entity]  # Extracted named entities
+# - claim_type: str  # factual, relational, temporal, etc.
+# - confidence: float  # Initial confidence score
+```
+
+#### Level 4: Claim Validation
+```python
+from src.verification.hsp_check import validate_claims
+
+report = await validate_claims(claims)
+# Result: HSPReport with:
+# - validated_claims: List[ValidatedClaim]
+# - overall_confidence: float
+# - validation_failures: List[str]
+# - evidence_summary: Dict[str, Any]
+```
+
+### Custom Evidence Provider
+
+```python
+from src.verification.hsp_check import HSPChecker, EvidenceProvider, Evidence
+import time
+
+class CustomEvidenceProvider(EvidenceProvider):
+    async def get_evidence(self, claim):
+        evidence = Evidence(
+            source="your_knowledge_base",
+            content="Supporting evidence text",
+            confidence=0.85,
+            timestamp=str(time.time()),
+            provenance_hash="your_hash_here"
+        )
+        return [evidence]
+
+    def get_reputation_score(self):
+        return 0.9  # Provider reputation (0.0-1.0)
+
+# Use custom provider
+checker = HSPChecker(CustomEvidenceProvider())
+report = await checker.validate_claims(claims)
+```
+
+### Hierarchical Validation Levels
+
+```
+Level 1: Content Filtering (REQUIRED)
+└─> Harmful content detection
+└─> PII identification and redaction
+└─> Inappropriate content filtering
+
+Level 2: Context-Aware Safety (RECOMMENDED)
+└─> Domain-specific validation
+└─> Relationship verification
+└─> Temporal consistency checks
+
+Level 3: Domain-Specific Safety (OPTIONAL)
+└─> Industry-specific rules
+└─> Compliance requirements
+└─> Custom validation logic
+
+Level 4: Human Escalation (EDGE CASES)
+└─> Ambiguous claims
+└─> Low-confidence assertions
+└─> Contradictory evidence
+```
+
+### Performance Targets
+- **Throughput**: ~500 claims/second
+- **Latency**: <10ms per claim validation
+- **Security**: Built-in DoS protection, injection detection
+- **Accuracy**: ≥95% entity extraction accuracy
+
+### When to Use HSP
+- ✅ Multi-sentence generated content
+- ✅ Entity-heavy claims (names, organizations, places)
+- ✅ Safety-critical applications
+- ✅ PII detection and redaction
+- ✅ High-volume claim validation
+- ❌ Single atomic claims (use CoVe instead)
+- ❌ Non-factual content (opinions, creative writing)
+
+**Implementation Location**: `/src/verification/hsp_check.py`
+**Documentation**: `/docs/public/verification/hsp-checking.md`
+
+---
+
+## 📚 RAG Grounding
+
+### Purpose
+Retrieve relevant evidence passages from Wolf's corpus to ground claims in contextual evidence.
+
+### Core Functions
+
+#### Retrieve Evidence
+```python
+from lib.rag import retrieve, format_context
+
+# Retrieve top-k relevant passages
+passages = retrieve(
+    query="security best practices",
+    n=5,
+    min_score=0.1
+)
+
+# Format with citations
+grounded_context = format_context(
+    passages,
+    max_chars=1200,
+    citation_format='[Source: {id}]'
+)
+```
+
+### Integration with wolf-core-ip MCP
+
+```typescript
+// RAG Retrieve - Get relevant evidence
+const retrieval = await rag_retrieve(
+  'security best practices',
+  { k: 5, min_score: 0.1, corpus_path: './corpus' },
+  sessionContext
+);
+
+// RAG Format - Format with citations
+const formatted = rag_format_context(
+  retrieval.passages,
+  { max_chars: 1200, citation_format: '[Source: {id}]' },
+  sessionContext
+);
+
+// Check Confidence - Multi-signal calibration
+const confidence = check_confidence(
+  {
+    model_confidence: 0.75,
+    evidence_count: retrieval.passages.length,
+    complexity: 0.6,
+    high_stakes: false
+  },
+  sessionContext
+);
+
+// Decision based on confidence
+if (confidence.recommendation === 'proceed') {
+  // Use response with RAG grounding
+} else if (confidence.recommendation === 'abstain') {
+  // Need more evidence, retrieve again with relaxed threshold
+} else {
+  // Low confidence, escalate to human review
+}
+```
+
+### Performance Targets
+- **rag_retrieve** (k=5): <200ms (actual: ~3-5ms)
+- **rag_format_context**: <50ms (actual: ~0.2-0.5ms)
+- **check_confidence**: <50ms (actual: ~0.05-0.1ms)
+- **Full Pipeline**: <300ms (actual: ~10-20ms)
+
+### Quality Metrics
+- **Citation Coverage**: ≥90% of claims have supporting passages
+- **Retrieval Precision**: ≥80% of retrieved passages are relevant
+- **Calibration Error**: <0.1 target
+
+### When to Use RAG
+- ✅ Claims requiring Wolf-specific context
+- ✅ Technical documentation references
+- ✅ Architecture decision lookups
+- ✅ Best practice queries
+- ✅ Code pattern searches
+- ❌ General knowledge facts (use CoVe with external sources)
+- ❌ Real-time events (corpus may be stale)
+
+**Implementation Location**: `servers/wolf-core-ip/tools/rag/`
+**MCP Access**: `mcp__wolf-core-ip__rag_retrieve`, `mcp__wolf-core-ip__rag_format_context`
+
+---
+
+## ✅ Verification-First Pattern
+
+### Purpose
+Generate verification checklist BEFORE creating response to reduce hallucination and improve fact-checking.
+
+### Why Verification-First?
+
+Traditional prompting:
+```
+User Query → Generate Response → Verify Response (maybe)
+```
+
+Verification-first prompting:
+```
+User Query → Generate Verification Checklist → Use Checklist to Guide Response → Validate Against Checklist
+```
+
+### Checklist Sections (5 required)
+
+```python
+{
+  "assumptions": [
+    # Implicit beliefs or prerequisites
+    "User has basic knowledge of quantum computing",
+    "Latest = developments in past 12 months"
+  ],
+  "sources": [
+    # Required information sources
+    "Recent quantum computing research papers",
+    "Industry announcements from major tech companies",
+    "Academic conference proceedings"
+  ],
+  "claims": [
+    # Factual assertions needing validation
+    "IBM achieved 127-qubit quantum processor in 2021",
+    "Google demonstrated quantum supremacy in 2019",
+    "Quantum computers can break RSA encryption"
+  ],
+  "tests": [
+    # Specific verification procedures
+    "Verify IBM qubit count against official announcements",
+    "Cross-check Google's quantum supremacy paper",
+    "Validate RSA encryption vulnerability claims"
+  ],
+  "open_risks": [
+    # Acknowledged limitations
+    "Quantum computing field evolves rapidly, information may be outdated",
+    "Technical accuracy depends on source reliability",
+    "Simplified explanations may omit nuances"
+  ]
+}
+```
+
+### Basic Usage
+
+```python
+from verification.checklist_scaffold import create_verification_scaffold
+
+# Initialize verification-first mode
+scaffold = create_verification_scaffold(verification_first=True)
+
+# Generate checklist BEFORE responding
+user_prompt = "Explain the latest developments in quantum computing"
+checklist_result = scaffold.generate_checklist(user_prompt)
+
+# Use checklist to guide response generation
+print(checklist_result["checklist"])
+```
+
+### Integration with Verification Pipeline
+
+```python
+# Step 1: Generate checklist
+checklist = scaffold.generate_checklist(user_prompt)
+
+# Step 2: Use CoVe to verify claims from checklist
+for claim in checklist["claims"]:
+    cove_report = await verify_claim(claim)
+    if cove_report.overall_confidence < 0.7:
+        print(f"Low confidence claim: {claim}")
+
+# Step 3: Use HSP for safety validation
+sentences = extract_sentences(generated_response)
+claims = build_claims(sentences)
+hsp_report = await validate_claims(claims)
+
+# Step 4: Ground in evidence with RAG
+passages = retrieve(user_prompt, n=5)
+grounded_context = format_context(passages)
+
+# Step 5: Check overall confidence
+confidence = check_confidence({
+    "model_confidence": 0.75,
+    "evidence_count": len(passages),
+    "complexity": assess_complexity(user_prompt)
+})
+
+# Step 6: Decide based on confidence
+if confidence.recommendation == 'proceed':
+    # Deliver response
+elif confidence.recommendation == 'abstain':
+    # Mark as "needs more research", gather more evidence
+else:
+    # Escalate to human review
+```
+
+### When to Use Verification-First
+- ✅ **REQUIRED**: Any factual claims about historical events, dates, names
+- ✅ **REQUIRED**: Technical specifications, API documentation, code behavior
+- ✅ **REQUIRED**: Security recommendations or threat assessments
+- ✅ **REQUIRED**: Performance claims or benchmarks
+- ✅ **REQUIRED**: Compliance or governance statements
+- ✅ **RECOMMENDED**: Complex multi-step reasoning
+- ✅ **RECOMMENDED**: Novel or unusual claims
+- ✅ **RECOMMENDED**: High-stakes decisions
+- ❌ **OPTIONAL**: Simple code comments
+- ❌ **OPTIONAL**: Internal notes or drafts
+- ❌ **OPTIONAL**: Exploratory research (clearly marked as such)
+
+**Implementation Location**: `/src/verification/checklist_scaffold.py`
+**Documentation**: `/docs/public/verification/verification-first.md`
+
+---
+
+## 🔄 Integrated Verification Workflow
+
+### Complete Self-Verification Pipeline
+
+```python
+# Step 1: Generate verification checklist FIRST (verification-first)
+scaffold = create_verification_scaffold(verification_first=True)
+checklist = scaffold.generate_checklist(task_description)
+
+# Step 2: Create response/code/documentation
+response = generate_response(task_description, checklist)
+
+# Step 3: Verify factual claims with CoVe
+factual_verification = await verify_claim("Your factual claim from response")
+
+# Step 4: Safety check with HSP
+sentences = extract_sentences(response)
+claims = build_claims(sentences)
+safety_check = await validate_claims(claims)
+
+# Step 5: Ground in evidence with RAG
+passages = retrieve(topic, n=5)
+context = format_context(passages)
+
+# Step 6: Check confidence
+confidence = check_confidence({
+    "model_confidence": 0.75,
+    "evidence_count": len(passages),
+    "complexity": assess_complexity(task_description),
+    "high_stakes": is_high_stakes(task_description)
+})
+
+# Step 7: Decision
+if confidence.recommendation == 'proceed':
+    # Deliver output with verification report
+    deliver(response, verification_report={
+        "cove": factual_verification,
+        "hsp": safety_check,
+        "rag": passages,
+        "confidence": confidence
+    })
+elif confidence.recommendation == 'abstain':
+    # Mark as "needs more research"
+    mark_for_review(response, reason="Medium confidence, need more evidence")
+else:
+    # Escalate to human review
+    escalate(response, reason="Low confidence, human review required")
+```
+
+### Confidence Calibration
+
+```python
+# Multi-signal confidence assessment
+confidence_signals = {
+    "model_confidence": 0.75,    # LLM's self-reported confidence
+    "evidence_count": 5,          # Number of supporting passages
+    "complexity": 0.6,            # Task complexity (0-1)
+    "high_stakes": False          # Is this safety-critical?
+}
+
+result = check_confidence(confidence_signals)
+
+# Result recommendations:
+# - 'proceed': High confidence (>0.8), use response
+# - 'abstain': Medium confidence (0.5-0.8), need more evidence
+# - 'escalate': Low confidence (<0.5), human review required
+```
+
+---
+
+## Hallucination Detection & Elimination
+
+### Detection
+
+```typescript
+import { detect } from './hallucination/detect';
+
+const detection = await detect(generated_text, sessionContext);
+
+// Returns:
+// {
+//   hallucinations_found: number,
+//   hallucination_types: string[],  // e.g., ["factual_error", "unsupported_claim"]
+//   confidence: number,
+//   details: Array<{ text: string, type: string, severity: string }>
+// }
+```
+
+### Elimination
+
+```typescript
+import { dehallucinate } from './hallucination/dehallucinate';
+
+if (detection.hallucinations_found > 0) {
+    const cleaned = await dehallucinate(
+        generated_text,
+        detection,
+        sessionContext
+    );
+    // Use cleaned output instead
+}
+```
+
+**Implementation Location**: `servers/wolf-core-ip/tools/hallucination/`
+
+---
+
+## Best Practices
+
+### CoVe Best Practices
+- ✅ Break complex claims into atomic steps
+- ✅ Establish step dependencies clearly
+- ✅ Use appropriate step types (factual, logical, etc.)
+- ✅ Provide evidence sources for factual steps
+- ❌ Don't create circular dependencies
+- ❌ Don't combine multiple claims in one step
+
+### HSP Best Practices
+- ✅ Process text sentence-by-sentence
+- ✅ Leverage built-in security features
+- ✅ Configure timeouts for production
+- ✅ Monitor validation failure rates
+- ❌ Don't bypass input sanitization
+- ❌ Don't ignore security warnings
+
+### RAG Best Practices
+- ✅ Set appropriate minimum score threshold
+- ✅ Retrieve enough passages (k=5-10)
+- ✅ Format with clear citations
+- ✅ Check confidence before using
+- ❌ Don't retrieve too few passages (k<3)
+- ❌ Don't ignore low relevance scores
+
+### Verification-First Best Practices
+- ✅ Generate checklist BEFORE responding
+- ✅ Use checklist to guide response structure
+- ✅ Validate response against checklist
+- ✅ Include open risks and limitations
+- ❌ Don't skip checklist generation for "simple" tasks
+- ❌ Don't ignore checklist during response creation
+
+---
+
+## Performance Summary
+
+| Component | Target | Actual | Status |
+|-----------|--------|--------|--------|
+| CoVe (5 steps) | <200ms | ~150ms | ✅ |
+| HSP (per claim) | <10ms | ~2ms | ✅ |
+| RAG retrieve (k=5) | <200ms | ~3-5ms | ✅ |
+| RAG format | <50ms | ~0.2-0.5ms | ✅ |
+| Confidence check | <50ms | ~0.05-0.1ms | ✅ |
+| Full pipeline | <300ms | ~10-20ms | ✅ |
+
+---
+
+## Red Flags - STOP
+
+If you catch yourself thinking:
+
+- ❌ **"This is low-stakes, no need for verification"** - STOP. Unverified claims compound. All factual claims need verification.
+- ❌ **"I'll verify after I finish the response"** - NO. Use Verification-First pattern. Generate checklist BEFORE responding.
+- ❌ **"The model is confident, that's good enough"** - Wrong. Model confidence ≠ factual accuracy. Always verify with external evidence.
+- ❌ **"Verification is too slow for this deadline"** - False. Full pipeline averages <20ms. Verification saves time by preventing rework.
+- ❌ **"I'll skip CoVe and just use RAG"** - NO. Each layer serves different purposes. CoVe = atomic facts, RAG = Wolf context, HSP = safety.
+- ❌ **"This is just internal documentation, no need to verify"** - Wrong. Incorrect internal docs are worse than no docs. Verify anyway.
+- ❌ **"Verification is optional for exploration"** - If generating factual claims, verification is MANDATORY. Mark speculation explicitly.
+
+**STOP. Use verification tools BEFORE claiming anything is factually accurate.**
+
+## After Using This Skill
+
+**VERIFICATION IS CONTINUOUS** - This skill is called DURING work, not after
+
+### When Verification Happens
+
+**Called by wolf-governance:**
+- During Definition of Done validation
+- As part of quality gate assessment
+- Before merge approval
+
+**Called by wolf-roles:**
+- During implementation checkpoints
+- Before PR creation
+- As continuous validation loop
+
+**Called by wolf-archetypes:**
+- When security lens applied (HSP required)
+- When research-prototyper needs evidence
+- When reliability-fixer validates root cause
+
+### Integration Points
+
+**1. With wolf-governance (Primary Caller)**
+- **When**: Before declaring work complete
+- **Why**: Verification is part of Definition of Done
+- **How**:
+  ```javascript
+  // Governance checks if verification passed
+  mcp__wolf-core-ip__check_confidence({
+    model_confidence: 0.75,
+    evidence_count: passages.length,
+    complexity: 0.6,
+    high_stakes: false
+  })
+  ```
+- **Gate**: Cannot claim DoD complete without verification evidence
+
+**2. With wolf-roles (Continuous Validation)**
+- **When**: During implementation at checkpoints
+- **Why**: Prevents late-stage verification failures
+- **How**: Use verification-first pattern for each claim
+- **Example**: coder-agent verifies API docs are accurate before committing
+
+**3. With wolf-archetypes (Lens-Driven)**
+- **When**: Security or research archetypes selected
+- **Why**: Specialized verification requirements
+- **How**:
+  - Security-hardener → HSP for safety validation
+  - Research-prototyper → CoVe for fact-checking
+  - Reliability-fixer → Verification of root cause analysis
+
+### Verification Checklist
+
+Before claiming verification is complete:
+
+- [ ] Generated verification checklist FIRST (verification-first pattern)
+- [ ] Used appropriate verification layer:
+  - [ ] CoVe for factual claims
+  - [ ] HSP for safety validation
+  - [ ] RAG for Wolf-specific context
+- [ ] Checked confidence scores:
+  - [ ] Overall confidence ≥0.8 for proceed
+  - [ ] 0.5-0.8 = needs more evidence (abstain)
+  - [ ] <0.5 = escalate to human review
+- [ ] Documented verification results in journal
+- [ ] Provided evidence sources for claims
+- [ ] Identified and documented open risks
+
+**Can't check all boxes? Verification incomplete. Return to this skill.**
+
+### Verification Examples
+
+#### Example 1: Feature Implementation
+
+```yaml
+Scenario: Coder-agent implementing user authentication
+
+Verification-First:
+  Step 1: Generate checklist BEFORE coding
+    - Assumptions: User has email, password requirements known
+    - Sources: OAuth 2.0 spec, bcrypt documentation
+    - Claims: bcrypt is secure for password hashing
+    - Tests: Verify bcrypt parameters against OWASP recommendations
+    - Open Risks: Password requirements may need to evolve
+
+  Step 2: Implement with checklist guidance
+
+  Step 3: Verify claims with CoVe
+    - Claim: "bcrypt is recommended by OWASP for password hashing"
+    - CoVe verification: ✅ Confidence 0.95
+    - Evidence: [OWASP Cheat Sheet, bcrypt documentation]
+
+  Step 4: Check overall confidence
+    - Model confidence: 0.85
+    - Evidence count: 3 passages
+    - Complexity: 0.4 (low)
+    - Result: 'proceed' ✅
+
+Assessment: Verified implementation, safe to proceed
+```
+
+#### Example 2: Security Review (Bad)
+
+```yaml
+Scenario: Security-agent reviewing authentication without verification
+
+❌ What went wrong:
+  - Skipped verification-first checklist generation
+  - Assumed encryption was correct without verifying
+  - No HSP safety validation performed
+  - No evidence retrieved for claims
+
+❌ Result:
+  - Claimed "authentication is secure" without evidence
+  - Missed hardcoded secrets (HSP would have caught)
+  - Missed deprecated crypto usage (CoVe would have caught)
+  - High confidence but no verification = hallucination
+
+Correct Approach:
+  1. Generate verification checklist for security claims
+  2. Use HSP to scan for secrets, PII, unsafe patterns
+  3. Use CoVe to verify crypto library recommendations
+  4. Use RAG to ground in Wolf security best practices
+  5. Check confidence before approving
+  6. Document verification evidence in journal
+```
+
+### Performance vs Quality Trade-offs
+
+**Verification is NOT slow:**
+- Full pipeline: <20ms average
+- CoVe (5 steps): ~150ms
+- HSP (per claim): ~2ms
+- RAG (k=5): ~3-5ms
+
+**Cost of skipping verification:**
+- Merge rejected due to factual errors: Hours of rework
+- Security vulnerability shipped: Days to patch + incident response
+- Documentation errors: Weeks of support burden + reputation damage
+
+**Verification is an investment, not overhead.**
+
+## Related Skills
+
+- **wolf-principles**: Evidence-based decision making principle (#5)
+- **wolf-governance**: Quality gate requirements (verification is DoD item)
+- **wolf-roles**: Roles call verification at checkpoints
+- **wolf-archetypes**: Lenses determine verification requirements
+- **wolf-adr**: ADR-043 (Verification Architecture)
+- **wolf-scripts-core**: Evidence validation patterns
+
+## Integration with Other Skills
+
+**Primary Chain Position**: Called DURING work (not after)
+
+```
+wolf-principles → wolf-archetypes → wolf-governance → wolf-roles
+                                            ↓
+                                    wolf-verification (YOU ARE HERE)
+                                            ↓
+                                    Continuous validation throughout implementation
+```
+
+**You are a supporting skill that enables quality:**
+- Governance depends on you for evidence
+- Roles depend on you for confidence
+- Archetypes depend on you for lens validation
+
+**DO NOT wait until the end to verify. Verify continuously.**
+
+---
+
+**Total Components**: 4 (CoVe, HSP, RAG, Verification-First)
+**Test Coverage**: ≥90% required (achieved: 98%+)
+**Production Status**: Active in Phase 50+
+
+**Last Updated**: 2025-11-14
+**Phase**: Superpowers Skill-Chaining Enhancement v2.0.0
+**Version**: 1.1.0