Files
gh-basher83-lunar-claude-pl…/skills/multi-agent-composition/patterns/context-management.md
2025-11-29 18:00:36 +08:00

17 KiB
Raw Blame History

Context Window Protection

"200k context window is plenty. You're just stuffing a single agent with too much work. Don't force your agent to context switch."

Context window protection is about managing your agent's most precious resource: attention. A focused agent is a performant agent.

The Core Problem

Every engineer hits this wall:

Agent starts:  10k tokens (5% used)
           ↓
After exploration:  80k tokens (40% used)
           ↓
After planning:  120k tokens (60% used)
           ↓
During implementation:  170k tokens (85% used) ⚠️
           ↓
Context explodes:  195k tokens (98% used) ❌
           ↓
Agent performance degrades, fails, or times out

The realization: More context ≠ better performance. Too much context = cognitive overload.

The R&D Framework

There are only two ways to manage your context window:

R - REDUCE
└─→ Minimize what enters the context window

D - DELEGATE
└─→ Move work to other agents' context windows

Everything else is a tactic implementing R or D.

The Four Levels of Context Protection

Level 1: Beginner - Reduce Waste

Focus: Stop wasting tokens on unused resources

Tactic 1: Eliminate Default MCP Servers

Problem:

# Default mcp.json
{
  "mcpServers": {
    "firecrawl": {...},   # 6k tokens
    "github": {...},      # 8k tokens
    "postgres": {...},    # 5k tokens
    "redis": {...}        # 5k tokens
  }
}
# Total: 24k tokens always loaded (12% of 200k window!)

Solution:

# Option 1: Delete default mcp.json entirely
rm .claude/mcp.json

# Option 2: Load selectively
claude-mcp-config --strict specialized-configs/firecrawl-only.json
# Result: 4k tokens instead of 24k (83% reduction)

Tactic 2: Minimize CLAUDE.md

Before:

# CLAUDE.md (23,000 tokens = 11.5% of window)
- 500 lines of API documentation
- 300 lines of deployment procedures
- 1,500 lines of coding standards
- Architecture diagrams
- Always loaded, whether relevant or not

After:

# CLAUDE.md (500 tokens = 0.25% of window)
# Only universal essentials

- Fenced code blocks MUST have language
- Use rg instead of grep
- ALWAYS use set -euo pipefail

Rule: Only include what you're 100% sure you want loaded 100% of the time.

Tactic 3: Disable Autocompact Buffer

Problem:

/context

# Output:
autocompact buffer: 22%  ⚠️ (44k tokens gone!)
messages: 51%
system_tools: 8%
---
Total available: 78% (should be 100%)

Solution:

/config
# Set: autocompact = false

# Now:
/context
# Output:
messages: 51%
system_tools: 8%
custom_agents: 2%
---
Total available: 91% ✅ (reclaimed 22%!)

Impact: Reclaims 40k+ tokens immediately.

Level 2: Intermediate - Dynamic Loading

Focus: Load what you need, when you need it

Tactic 4: Context Priming

Replace static CLAUDE.md with task-specific /prime commands

# .claude/commands/prime.md
# General codebase context (2k tokens)
Read README, understand structure, report findings

# .claude/commands/prime-feature.md
# Feature development context (3k tokens)
Read feature requirements, understand dependencies, plan implementation

# .claude/commands/prime-api.md
# API work context (4k tokens)
Read API docs, understand endpoints, review integration patterns

Usage pattern:

# Starting feature work
/prime-feature

# vs. having 23k tokens always loaded

Savings: 20k tokens (87% reduction)

Tactic 5: Sub-Agent Delegation

Problem: Primary agent doing parallel work fills its own context

Primary Agent tries to do:
├── Web scraping (15k tokens)
├── Documentation fetch (12k tokens)
├── Data analysis (10k tokens)
└── Synthesis (5k tokens)
= 42k tokens in one agent

Solution: Delegate to sub-agents with isolated contexts

Primary Agent (9k tokens):
├→ Sub-Agent 1: Web scraping (15k tokens, isolated)
├→ Sub-Agent 2: Docs fetch (12k tokens, isolated)
└→ Sub-Agent 3: Analysis (10k tokens, isolated)

Total work: 46k tokens
Primary agent context: Only 9k tokens ✅

Example:

/load-ai-docs

# Agent spawns 10 sub-agents for web scraping
# Each scrape: ~3k tokens
# Total work: 30k tokens
# Primary agent context: Still only 9k tokens
# Savings: 21k tokens protected

Key insight: Sub-agents use system prompts (not user prompts), keeping their context isolated from primary.

Level 3: Advanced - Multi-Agent Handoff

Focus: Chain agents together without context explosion

Tactic 6: Context Bundles

Problem: Agent 1's context explodes (180k tokens). Need to hand off to fresh Agent 2 without full replay.

Solution: Bundle 60-70% of essential context

# context-bundle-2025-01-05-<session-id>.md

## Context Bundle
Created: 2025-01-05 14:30
Source Agent: agent-abc123

## Initial Setup
/prime-feature

## Read Operations (deduplicated)
- src/api/endpoints.ts
- src/components/Auth.tsx
- config/env.ts

## Key Findings
- Auth system uses JWT
- API has 15 endpoints
- Config needs migration

## User Prompts (summarized)
1. "Implement OAuth2 flow"
2. "Add refresh token logic"

[Excluded: full write operations, detailed read contents, tool execution details]

Usage:

# Agent 1: Context exploding at 180k
# Automatic bundle saved

# Agent 2: Fresh start (10k base)
/loadbundle /path/to/context-bundle-<timestamp>.md
# Agent 2 now has 70% of Agent 1's context in ~15k tokens

# Total: 25k tokens vs. 180k (86% reduction)

Tactic 7: Composable Workflows (Scout-Plan-Build)

Problem: Single agent searching + planning + building = context explosion

Monolithic Agent:
├── Search codebase: 40k tokens
├── Read files: 60k tokens
├── Plan changes: 20k tokens
├── Implement: 30k tokens
├── Test: 15k tokens
└── Total: 165k tokens (83% used!)

Solution: Break into composable steps that delegate

/scout-plan-build workflow:

Step 1: /scout (delegates to 4 parallel sub-agents)
├→ Sub-agents search codebase: 4 × 15k = 60k total
├→ Output: relevant-files.md (5k tokens)
└→ Primary agent context: unchanged

Step 2: /plan-with-docs
├→ Reads relevant-files.md: 5k tokens
├→ Scrapes docs: 8k tokens
├→ Creates plan: 3k tokens
└→ Total added: 16k tokens

Step 3: /build
├→ Reads plan: 3k tokens
├→ Implements: 30k tokens
└→ Total added: 33k tokens

Final primary agent context: 10k + 16k + 33k = 59k tokens
Savings: 106k tokens (64% reduction)

Why this works: Scout step offloads searching from planner (R&D: Reduce + Delegate)

Level 4: Agentic - Out-of-Loop Systems

Focus: Agents working autonomously while you're AFK

Tactic 8: Focused Agents (One Agent, One Task)

Anti-pattern:

Super Agent (trying to do everything):
├── API development
├── UI implementation
├── Database migrations
├── Testing
├── Documentation
├── Deployment
└── Context: 170k tokens (85% used)

Pattern:

Focused Agent Fleet:
├── Agent 1: API only (30k tokens)
├── Agent 2: UI only (35k tokens)
├── Agent 3: DB only (20k tokens)
├── Agent 4: Tests only (25k tokens)
├── Agent 5: Docs only (15k tokens)
└── Each agent: <35k tokens (max 18% per agent)

Principle: "A focused engineer is a performant engineer. A focused agent is a performant agent."

Tactic 9: Deletable Agents

Pattern:

# Create agent for specific task
/create-agent docs-writer "Document frontend components"

# Agent completes task (used 30k tokens)

# DELETE agent immediately
/delete-agent docs-writer

# Result: 30k tokens freed for next agent

Lifecycle:

1. Create agent → Task-specific context loaded
2. Agent works → Context grows to completion
3. Agent completes → Context maxed out
4. DELETE agent → Context freed
5. Create new agent → Fresh start
6. Repeat

Engineering analogy: "The best code is no code at all. The best agent is a deleted agent."

Tactic 10: Background Agent Delegation

Problem: You're in the loop, waiting for agent to finish long task

Solution: Delegate to background agent, continue working

# In-loop (you wait, your context stays open)
/implement-feature "Build auth system"
# Your terminal blocked for 20 minutes
# Context accumulates: 150k tokens

# Out-of-loop (you continue working)
/background "Build auth system" \
  --model opus \
  --report agents/auth-report.md

# Background agent works independently
# Your terminal freed immediately
# Background agent context isolated
# You get notified when complete

Context protection:

  • Primary agent: 10k tokens (just manages job queue)
  • Background agent: 150k tokens (isolated, will be deleted)
  • Your interactive session: 10k tokens (protected)

Tactic 11: Orchestrator Sleep Pattern

Problem: Orchestrator observing all agent work = context explosion

Orchestrator watches everything:
├── Scout 1 work: 15k tokens observed
├── Scout 2 work: 15k tokens observed
├── Scout 3 work: 15k tokens observed
├── Planner work: 25k tokens observed
├── Builder work: 35k tokens observed
└── Orchestrator context: 105k tokens

Solution: Orchestrator sleeps while agents work

Orchestrator pattern:
1. Create scouts → 3k tokens (commands only)
2. SLEEP (not observing)
3. Wake every 15s, check status → 1k tokens
4. Scouts complete, read outputs → 5k tokens
5. Create planner → 2k tokens
6. SLEEP (not observing)
7. Wake every 15s, check status → 1k tokens
8. Planner completes, read output → 3k tokens
9. Create builder → 2k tokens
10. SLEEP (not observing)

Orchestrator final context: 17k tokens ✅
vs. 105k if watching everything (84% reduction)

Key principle: Orchestrator wakes to coordinate, sleeps while agents work.

Monitoring Context Health

The /context Command

/context

# Healthy agent (beginner level):
messages: 8%
system_tools: 5%
custom_agents: 2%
---
Total used: 15%  ✅ (85% free)

# Warning (intermediate):
messages: 45%
mcp_tools: 18%
system_tools: 5%
---
Total used: 68%  ⚠️ (32% free, approaching limits)

# Danger (needs intervention):
messages: 72%
mcp_tools: 24%
system_tools: 5%
---
Total used: 101%  ❌ (context overflow!)

Success Metrics by Level

Level Target Context Free What This Enables
Beginner 85-90% Basic tasks without running out
Intermediate 60-75% Complex tasks with breathing room
Advanced 40-60% Multi-step workflows without overflow
Agentic Per-agent 60-80% Fleet of focused agents

Warning Signs

Your context window is in danger when:

Single agent exceeds 150k tokens

  • Solution: Split work across multiple agents

Agent needs to read >20 files

  • Solution: Use scout agents to find relevant subset

/context shows >80% used

  • Solution: Start fresh agent, use context bundles

Agent gets slower/less accurate

  • Solution: Check context usage, delegate to sub-agents

Autocompact buffer active

  • Solution: Disable it, reclaim 20%+ tokens

Context Window Hard Limits

"Context window is a hard limit. We have to respect this and work around it."

The Reality

Claude Opus 200k limit:
├── System prompt: ~8k tokens (4%)
├── Available tools: ~5k tokens (2.5%)
├── MCP servers: 0-24k tokens (0-12%)
├── CLAUDE.md: 0-23k tokens (0-11.5%)
├── Custom agents: ~2k tokens (1%)
└── Available for work: 138-185k tokens (69-92.5%)

Best case (optimized): 185k available
Worst case (unoptimized): 138k available
Difference: 47k tokens (25% of total capacity!)

Real Example from the Field

"We were 14% away from exploding our context in our scout-plan-build workflow."

Scout-Plan-Build execution:
├── Base context: 15k tokens
├── Scout work (4 sub-agents): +40k tokens
├── Planner work: +35k tokens
├── Builder work: +80k tokens
└── Total: 170k tokens

With autocompact buffer (22%):
170k / 0.78 = 218k tokens
❌ Exceeds 200k limit by 18k (9% overflow)

Without autocompact buffer:
170k / 1.0 = 170k tokens
✅ Within limits with 30k buffer (15% free)

Lesson: Every percentage point matters when approaching limits.

Common Context Explosion Patterns

Pattern 1: The Sponge Agent

Symptoms:

  • Agent reads entire codebase
  • Opens 50+ files
  • Context grows 10k tokens every few minutes

Cause: No filtering strategy

Fix:

# Before: Agent reads everything
Agent: "Analyzing codebase..."
[reads 100 files = 150k tokens]

# After: Scout first
/scout "Find files related to authentication"
# Scout outputs: 5 relevant files
Agent reads only those 5 files = 8k tokens

Pattern 2: The Accumulator

Symptoms:

  • Long conversation
  • Many tool calls
  • Context steadily grows to limit

Cause: Not resetting agent between phases

Fix:

# Phase 1: Exploration
[Agent explores, context hits 120k]

# Phase 2: Implementation
# ❌ Bad: Continue same agent (will overflow)
# ✅ Good: New agent with context bundle

/loadbundle context-from-phase-1.md
# Fresh agent (15k) + bundle (20k) = 35k tokens
# Ready for implementation without overflow

Pattern 3: The Observer

Symptoms:

  • Orchestrator context growing rapidly
  • Watching all sub-agent work
  • Can't coordinate more than 2-3 agents

Cause: Not using sleep pattern

Fix:

# ❌ Bad: Orchestrator watches everything
for agent in agents:
    result = orchestrator.watch_agent_work(agent)  # Observes all work
    orchestrator.context += result  # Context explodes

# ✅ Good: Orchestrator sleeps
for agent in agents:
    orchestrator.create_and_command(agent)
    orchestrator.sleep()  # Not observing

orchestrator.wake_and_check_status()  # Only reads summaries

The "200k is Plenty" Principle

"I'm super excited for larger effective context windows, but 200k context window is plenty. You're just stuffing a single agent with too much work."

The mindset shift:

Beginner thinking:
"I need a bigger context window"
"If only I had 500k tokens..."
"My task is too complex for 200k"

Expert thinking:
"I need better context management"
"I'm overloading a single agent"
"I should split this across focused agents"

The truth: Most context explosions are design problems, not capacity problems.

Why 200k is Sufficient

With proper protection:

Task: Refactor authentication across 50-file codebase

Approach 1 (Single Agent - fails):
├── Agent reads 50 files: 75k tokens
├── Agent plans changes: 20k tokens
├── Agent implements: 80k tokens
├── Agent tests: 30k tokens
└── Total: 205k tokens ❌ (overflow by 5k)

Approach 2 (Multi-Agent - succeeds):
├── Scout finds relevant 10 files: 15k tokens
├── Planner creates strategy: 20k tokens (new agent)
├── Builder 1 (auth logic): 35k tokens (new agent)
├── Builder 2 (UI changes): 30k tokens (new agent)
├── Tester verifies: 25k tokens (new agent)
└── Max per agent: 35k tokens ✅ (all within limits)

Integration with Other Patterns

Context window protection enables:

Progressive Disclosure:

  • Reduces: Minimal static context
  • Enables: Dynamic loading via priming

Core 4 Management:

  • Protects: Context (pillar #1)
  • Enables: Better model/prompt/tools choices

Orchestration:

  • Requires: Context protection (orchestrator sleep)
  • Enables: Fleet management without overflow

Observability:

  • Monitors: Context usage via hooks
  • Prevents: Unnoticed context explosion

Key Principles

  1. Reduce and Delegate - The only two strategies that matter

  2. A focused agent is a performant agent - Single-purpose beats multi-purpose

  3. Agents are deletable - Free context by removing completed agents

  4. 200k is plenty - Context explosions are design problems

  5. Monitor constantly - /context command is your best friend

  6. Orchestrators must sleep - Don't observe all agent work

  7. Context bundles over full replay - 70% of context in 10% of tokens

Source Attribution

Primary sources:

  • Elite Context Engineering (R&D framework, 4 levels, all tactics)
  • Claude 2.0 (autocompact buffer, hard limits, scout-plan-build)

Supporting sources:

  • One Agent to Rule Them All (orchestrator sleep, 200k principle, deletable agents)
  • Sub-Agents (sub-agent delegation, context isolation)

Key quotes:

  • "200k context window is plenty. You're just stuffing a single agent with too much work." (One Agent)
  • "A focused agent is a performant agent." (Elite Context Engineering)
  • "We were 14% away from exploding our context." (Claude 2.0)
  • "There are only two ways to manage your context window: R and D." (Elite Context Engineering)

Remember: Context window management separates beginners from experts. Master it, and you can scale infinitely with focused agents.