# Error Handling and Recovery for Subagents Industry research identifies these failure patterns: **32% of failures**: Subagents don't know what to do. **Causes**: - Vague or incomplete role definition - Missing workflow steps - Unclear success criteria - Ambiguous constraints **Symptoms**: Subagent asks clarifying questions (can't if it's a subagent), makes incorrect assumptions, produces partial outputs, or fails to complete task. **Prevention**: Explicit ``, ``, ``, and `` sections in prompt. **28% of failures**: Coordination breakdowns in multi-agent workflows. **Causes**: - Subagents have conflicting objectives - Handoff points unclear - No shared context or state - Assumptions about other agents' outputs **Symptoms**: Duplicate work, contradictory outputs, infinite loops, tasks falling through cracks. **Prevention**: Clear orchestration patterns (see [orchestration-patterns.md](orchestration-patterns.md)), explicit handoff protocols. **24% of failures**: Nobody checks quality. **Causes**: - No validation step in workflow - Missing output format specification - No error detection logic - Blind trust in subagent outputs **Symptoms**: Incorrect results silently propagated, hallucinations undetected, format errors break downstream processes. **Prevention**: Include verification steps in subagent workflows, validate outputs before use, implement evaluator agents. **Critical pattern**: Failures in one subagent propagate to others. **Causes**: - No error handling in downstream agents - Assumptions that upstream outputs are valid - No circuit breakers or fallbacks **Symptoms**: Single failure causes entire workflow to fail. **Prevention**: Defensive programming in subagent prompts, graceful degradation strategies, validation at boundaries. **Inherent challenge**: Same prompt can produce different outputs. **Causes**: - LLM sampling and temperature settings - API latency variations - Context window ordering effects **Symptoms**: Inconsistent behavior across invocations, tests pass sometimes and fail other times. **Mitigation**: Lower temperature for consistency-critical tasks, comprehensive testing to identify variation patterns, robust validation. **Pattern**: Workflow produces useful result even when ideal path fails. ```markdown 1. Attempt to fetch latest API documentation from web 2. If fetch fails, use cached documentation (flag as potentially outdated) 3. If no cache available, use local stub documentation (flag as incomplete) 4. Generate code with best available information 5. Add TODO comments indicating what should be verified - Primary: Live API docs (most accurate) - Secondary: Cached docs (may be stale, flag date) - Tertiary: Stub docs (minimal, flag as incomplete) - Always: Add verification TODOs to generated code ``` **Key principle**: Partial success better than total failure. Always produce something useful. **Pattern**: Subagent retries failed operations with exponential backoff. ```markdown When a tool call fails: 1. Attempt operation 2. If fails, wait 1 second and retry 3. If fails again, wait 2 seconds and retry 4. If fails third time, proceed with fallback approach 5. Document the failure in output Maximum 3 retry attempts before falling back. ``` **Use case**: Transient failures (network issues, temporary file locks, rate limits). **Anti-pattern**: Infinite retry loops without backoff or max attempts. **Pattern**: Prevent cascading failures by stopping calls to failing components. ```markdown If API endpoint has failed 5 consecutive times: - Stop calling the endpoint (circuit "open") - Use fallback data source - After 5 minutes, attempt one call (circuit "half-open") - If succeeds, resume normal calls (circuit "closed") - If fails, keep circuit open for another 5 minutes ``` **Application to subagents**: Include in prompt when subagent calls external APIs or services. **Benefit**: Prevents wasting time/tokens on operations known to be failing. **Pattern**: Agents going silent shouldn't block workflow indefinitely. ```markdown For long-running operations: 1. Set reasonable timeout (e.g., 2 minutes for analysis) 2. If operation exceeds timeout: - Abort operation - Provide partial results if available - Clearly flag as incomplete - Suggest manual intervention ``` **Note**: Claude Code has built-in timeouts for tool calls. Subagent prompts should include guidance on what to do when operations approach reasonable time limits. **Pattern**: Different validators catch different error types. ```markdown After generating code: 1. Syntax check: Parse code to verify valid syntax 2. Type check: Run static type checker (if applicable) 3. Linting: Check for common issues and anti-patterns 4. Security scan: Check for obvious vulnerabilities 5. Test run: Execute tests if available If any check fails, fix issue and re-run all checks. Each check catches different error types. ``` **Benefit**: Layered validation catches more issues than single validation pass. **Pattern**: Invoke alternative agents or escalate to human when primary approach fails. ```markdown If automated fix fails after 2 attempts: 1. Document what was tried and why it failed 2. Provide diagnosis of the problem 3. Recommend human review with specific questions to investigate 4. DO NOT continue attempting automated fixes that aren't working Know when to escalate rather than thrashing. ``` **Key insight**: Subagents should recognize their limitations and provide useful handoff information. Multi-agent systems fail when communication is ambiguous. Structured messaging prevents misunderstandings. Every message between agents (or from agent to user) should have explicit type: **Request**: Asking for something ```markdown Type: Request From: code-reviewer To: test-writer Task: Create tests for authentication module Context: Recent security review found gaps in auth testing Expected output: Comprehensive test suite covering auth edge cases ``` **Inform**: Providing information ```markdown Type: Inform From: debugger To: Main chat Status: Investigation complete Findings: Root cause identified in line 127, race condition in async handler ``` **Commit**: Promising to do something ```markdown Type: Commit From: security-reviewer Task: Review all changes in PR #342 for security issues Deadline: Before responding to main chat ``` **Reject**: Declining request with reason ```markdown Type: Reject From: test-writer Reason: Cannot write tests - no testing framework configured in project Recommendation: Install Jest or similar framework first ``` **Pattern**: Validate every payload against expected schema. ```markdown Expected output format: { "vulnerabilities": [ { "severity": "Critical|High|Medium|Low", "location": "file:line", "type": "string", "description": "string", "fix": "string" } ], "summary": "string" } Before returning output: 1. Verify JSON is valid 2. Check all required fields present 3. Validate severity values are from allowed list 4. Ensure location follows "file:line" format ``` **Benefit**: Prevents malformed outputs from breaking downstream processes. "Most agent failures are not model failures, they are context failures." **What to log**: - Input prompts and parameters - Tool calls and their results - Intermediate reasoning (if visible) - Final outputs - Metadata (timestamps, model version, token usage, latency) - Errors and warnings **Log structure**: ```markdown Invocation ID: abc-123-def Timestamp: 2025-11-15T14:23:01Z Subagent: security-reviewer Model: sonnet-4.5 Input: "Review changes in commit a3f2b1c" Tool calls: 1. git diff a3f2b1c (success, 234 lines) 2. Read src/auth.ts (success, 156 lines) 3. Read src/db.ts (success, 203 lines) Output: 3 vulnerabilities found (2 High, 1 Medium) Tokens: 2,341 input, 876 output Latency: 4.2s Status: Success ``` **Use case**: Debugging failures, identifying patterns, performance optimization. **Pattern**: Track every message, plan, and tool call for end-to-end reconstruction. ```markdown Correlation ID: workflow-20251115-abc123 Main chat [abc123]: → Launched code-reviewer [abc123-1] → Tool: git diff [abc123-1-t1] → Tool: Read auth.ts [abc123-1-t2] → Returned: 3 issues found → Launched test-writer [abc123-2] → Tool: Read auth.ts [abc123-2-t1] → Tool: Write auth.test.ts [abc123-2-t2] → Returned: Test suite created → Presented results to user ``` **Benefit**: Can trace entire workflow execution, identify where failures occurred, understand cascading effects. **Key metrics to track**: - Success rate (completed tasks / total invocations) - Error rate by error type - Average token usage (spikes indicate prompt issues) - Latency trends (increases suggest inefficiency) - Tool call patterns (unusual patterns indicate problems) - Retry rates (how often users re-invoke after failure) **Alert thresholds**: - Success rate drops below 80% - Error rate exceeds 15% - Token usage increases >50% without prompt changes - Latency exceeds 2x baseline - Same error type occurs >5 times in 24 hours **Pattern**: Dedicated quality guardrail agents validate outputs. ```markdown --- name: output-validator description: Validates subagent outputs against expected schemas and quality criteria. Use after any subagent produces structured output. tools: Read model: haiku --- You are an output validation specialist. Check subagent outputs for: - Schema compliance - Completeness - Internal consistency - Format correctness 1. Receive subagent output and expected schema 2. Validate structure matches schema 3. Check for required fields 4. Verify value constraints (enums, formats, ranges) 5. Test internal consistency (references valid, no contradictions) 6. Return validation report: Pass/Fail with specific issues Pass: All checks succeed Fail: Any check fails - provide detailed error report Partial: Minor issues that don't prevent use - flag warnings ``` **Use case**: Critical workflows where output quality is essential, high-risk operations, compliance requirements. ❌ Subagent fails but doesn't indicate failure in output **Example**: ```markdown Task: Review 10 files for security issues Reality: Only reviewed 3 files due to errors, returned results anyway Output: "No issues found" (incomplete review, but looks successful) ``` **Fix**: Explicitly state what was reviewed, flag partial completion, include error summary. ❌ When ideal path fails, subagent gives up entirely **Example**: ```markdown Task: Generate code from API documentation Error: API docs unavailable Output: "Cannot complete task, API docs not accessible" ``` **Better**: ```markdown Error: API docs unavailable Fallback: Using cached documentation (last updated: 2025-11-01) Output: Code generated with note: "Verify against current API docs, using cached version" ``` **Principle**: Provide best possible output given constraints, clearly flag limitations. ❌ Retrying failed operations without backoff or limit **Risk**: Wastes tokens, time, and may hit rate limits. **Fix**: Maximum retry count (typically 2-3), exponential backoff, fallback after exhausting retries. ❌ Downstream agents assume upstream outputs are valid **Example**: ```markdown Agent 1: Generates code (contains syntax error) ↓ Agent 2: Writes tests (assumes code is syntactically valid, tests fail) ↓ Agent 3: Runs tests (all tests fail due to syntax error in code) ↓ Total workflow failure from single upstream error ``` **Fix**: Each agent validates inputs before processing, includes error handling for invalid inputs. ❌ Error messages without diagnostic context **Bad**: "Failed to complete task" **Good**: "Failed to complete task: Unable to access file src/auth.ts (file not found). Attempted to review authentication code but file missing from expected location. Recommendation: Verify file path or check if file was moved/deleted." **Principle**: Error messages should help diagnose root cause and suggest remediation. Include these patterns in subagent prompts: **Error detection**: - [ ] Validate inputs before processing - [ ] Check tool call results for errors - [ ] Verify outputs match expected format - [ ] Test assumptions (file exists, data valid, etc.) **Recovery mechanisms**: - [ ] Define fallback approach for primary path failure - [ ] Include retry logic for transient failures - [ ] Graceful degradation (partial results better than none) - [ ] Clear error messages with diagnostic context **Failure communication**: - [ ] Explicitly state when task cannot be completed - [ ] Explain what was attempted and why it failed - [ ] Provide partial results if available - [ ] Suggest remediation or next steps **Quality gates**: - [ ] Validation steps before returning output - [ ] Self-checking (does output make sense?) - [ ] Format compliance verification - [ ] Completeness check (all required components present?)