Files

Zhongwei Li fab98d059b Initial commit

2025-11-30 09:07:22 +08:00

45 KiB

Raw Blame History

BAIME Usage Guide

BAIME (Bootstrapped AI Methodology Engineering) - A systematic framework for developing and validating software engineering methodologies through observation, codification, and automation.

What is BAIME?
When to Use BAIME
Prerequisites
Core Concepts
Frequently Asked Questions
Quick Start
Step-by-Step Workflow
Specialized Agents
Practical Example
Troubleshooting
Next Steps

What is BAIME?

BAIME integrates three complementary methodologies optimized for LLM-based development:

OCA Cycle (Observe-Codify-Automate) - Core iterative framework
Empirical Validation - Scientific method and data-driven decisions
Value Optimization - Dual-layer value functions for quantitative evaluation

Key Innovation: BAIME treats methodology development like software development—with empirical observation, automated testing, continuous iteration, and quantitative metrics.

Why BAIME?

Problem: Ad-hoc methodology development is slow, subjective, and hard to validate.

Solution: BAIME provides systematic approach with:

✅ Rapid convergence: Typically 3-7 iterations, 6-15 hours
✅ Empirical validation: Data-driven evidence, not opinions
✅ High transferability: 70-95% reusable across projects
✅ Proven results: 100% success rate across 8 experiments, 10-50x speedup

BAIME in Action

Example Results:

Testing Strategy: 15x speedup, 89% transferability
CI/CD Pipeline: 2.5-3.5x speedup, 91.7% pattern validation
Error Recovery: 95.4% error coverage, 3 iterations
Documentation System: 47% token cost reduction, 85% reduction in redundancy
Knowledge Transfer: 3-8x onboarding speedup

When to Use BAIME

Use BAIME For

✅ Creating systematic methodologies for:

Testing strategies
CI/CD pipelines
Error handling patterns
Observability systems
Dependency management
Documentation systems
Knowledge transfer processes
Technical debt management
Cross-cutting concerns

✅ When you need:

Empirical validation with data
Iterative methodology evolution
Quantitative quality metrics
Transferable best practices
Rapid convergence (hours to days, not weeks)

Don't Use BAIME For

❌ One-time ad-hoc tasks without reusability goals ❌ Trivial processes (<100 lines of code/docs) ❌ Established standards that fully solve your problem

Prerequisites

Required

meta-cc plugin installed and configured
- See Installation Guide
- Verify: /meta "show stats" works
Claude Code environment
- Access to Task tool for subagent invocation
Project with need for methodology
- Have a specific domain in mind (testing, CI/CD, etc.)
- Able to measure current state (baseline)

Core Concepts

Understanding Value Functions

BAIME uses dual-layer value functions to measure quality at two independent levels:

V_instance: Domain-Specific Quality

Measures the quality of your specific deliverables:

Purpose: Assess whether your domain work is high-quality
Examples:
- Testing methodology: Test coverage percentage, test maintainability
- CI/CD pipeline: Build time, deployment success rate, quality gate coverage
- Documentation: Completeness, accuracy, usability
Characteristics: Domain-dependent, specific to your work

V_meta: Methodology Quality

Measures the quality of the methodology itself:

Purpose: Assess whether your methodology is reusable and effective
Components:
- Completeness: All necessary patterns, templates, tools exist
- Effectiveness: Methodology improves quality and efficiency
- Reusability: Works across projects with minimal adaptation
- Validation: Empirically tested and proven effective
Characteristics: Domain-independent, universal assessment

Convergence Requirement

Both must reach ≥ 0.80 for methodology to be complete:

V_instance ≥ 0.80: Domain work is production-ready
V_meta ≥ 0.80: Methodology is reusable
If only one converges, keep iterating

The OCA Cycle

Each iteration follows the Observe-Codify-Automate cycle:

Observe → Codify → Automate → Evaluate
   ↓                              ↓
   ← ← ← ← ← Iterate ← ← ← ← ← ←

Phase 1: Observe

Goal: Collect empirical data about current state

Activities:

Read previous iteration results
Measure baseline (Iteration 0) or current state
Identify problems and patterns
Gather evidence about what's working/not working

Output: Data artifacts documenting observations

Phase 2: Codify

Goal: Extract patterns and create reusable structures

Activities:

Form strategy based on evidence
Extract recurring patterns into documented forms
Create templates for common structures
Prioritize improvements based on impact

Output: Patterns, templates, strategy documentation

Phase 3: Automate

Goal: Build tools to improve efficiency and consistency

Activities:

Create automation scripts (validators, generators, analyzers)
Implement quality gates
Build CI integration
Execute planned improvements

Output: Working tools, improved deliverables

Phase 4: Evaluate

Goal: Measure progress and assess convergence

Activities:

Calculate V_instance and V_meta scores
Provide evidence for each component
Identify remaining gaps
Check convergence criteria

Output: Value scores, gap analysis, convergence decision

Meta-Agent and Specialized Agents

Meta-Agent

The meta-agent orchestrates the entire BAIME process:

Responsibilities:

Read lifecycle capabilities before each phase (fresh, no caching)
Execute OCA cycle systematically
Track system state evolution (M_n, A_n, s_n)
Coordinate specialized agents when needed
Make evidence-based evolution decisions

Key Behavior: Reads capabilities fresh each iteration to incorporate latest guidance

Specialized Agents

Domain-specific executors created when evidence shows need:

When created:

Generic approach insufficient (demonstrated, not assumed)
Task recurs 3+ times with similar structure
Clear expected improvement from specialization

Examples:

test-generator: Creates tests following validated patterns
validator-agent: Checks deliverables against quality criteria
knowledge-extractor: Transforms experiment into reusable methodology

Key Principle: Agents evolve based on retrospective evidence (not anticipatory design)

Capabilities and System State

Capabilities

Modular guidance files for each OCA lifecycle phase:

capabilities/collect.md - Data collection patterns
capabilities/strategy.md - Strategy formation guidance
capabilities/execute.md - Execution patterns
capabilities/evaluate.md - Evaluation rubrics
capabilities/converge.md - Convergence assessment

Evolution:

Start empty (placeholders) in Iteration 0
Evolve when patterns recur 2-3 times
Based on retrospective evidence (not speculation)
Read fresh each phase (no caching)

System State

Tracked components across iterations:

M_n: Methodology components (capabilities, patterns, templates)
A_n: Agent system (specialized agents)
s_n: Current state (deliverables, artifacts, value scores)
V(s_n): Dual value functions (V_instance, V_meta)

State transition: s_{n-1} → s_n documents evolution

Convergence Criteria

Methodology is complete and production-ready when all four conditions met:

1. Dual Threshold

✅ V_instance ≥ 0.80 (domain goals achieved)
✅ V_meta ≥ 0.80 (methodology quality high)

2. System Stability

✅ M_n == M_{n-1} (no methodology changes)
✅ A_n == A_{n-1} (no agent evolution)
✅ Stable for 2+ consecutive iterations

3. Objectives Complete

✅ All planned work finished
✅ No critical gaps remaining

4. Diminishing Returns

✅ ΔV_instance < 0.02 for 2+ iterations
✅ ΔV_meta < 0.02 for 2+ iterations

Note: If system evolves (new agent/capability), stability clock resets. Evolution must be validated in next iteration before convergence.

Frequently Asked Questions

General Questions

What exactly is BAIME and how is it different from other methodologies?

BAIME (Bootstrapped AI Methodology Engineering) is a meta-methodology for developing domain-specific methodologies through empirical observation and iteration. Unlike traditional methodologies that are designed upfront, BAIME creates methodologies through practice:

Traditional approach: Design methodology → Apply → Hope it works
BAIME approach: Observe patterns → Extract methodology → Validate → Iterate

Key differentiators:

Dual-layer value functions measure both deliverable quality AND methodology quality
Evidence-driven evolution (not anticipatory design)
Quantitative convergence criteria (≥0.80 thresholds)
Specialized subagents for consistent execution

When should I use BAIME vs just following existing best practices?

Use BAIME when:

No established methodology fully fits your domain
You need methodology customized to your project constraints
You want empirically validated patterns, not borrowed practices
You need to measure and prove methodology effectiveness

Use existing practices when:

Industry-standard methodology already solves your problem
Team already trained on established framework
Project timeline doesn't allow methodology development
Problem domain is simple and well-understood

Use both: Start with BAIME to develop baseline, then integrate proven external practices in later iterations.

How long does a typical BAIME experiment take?

Typical timeline:

Iteration 0 (Baseline): 2-4 hours
Iterations 1-N: 3-6 hours each
Total: 10-30 hours over 3-7 iterations
Knowledge extraction: 2-4 hours post-convergence

Time factors:

Domain complexity (testing < CI/CD < architecture)
Baseline quality (higher baseline → fewer iterations)
Team familiarity with BAIME (improves with practice)
Automation investment (upfront cost, ongoing savings)

ROI: 10-50x speedup on future work justifies investment. A 20-hour methodology development that saves 10 hours per month pays off in month 2.

What if my value scores aren't improving between iterations?

Diagnostic steps:

Check if addressing root problems:
- Review problem identification from previous iteration
- Are you solving symptoms vs causes?
- Example: Low test coverage may be due to unclear testing strategy, not lack of tests
Verify evidence quality:
- Is data collection comprehensive?
- Are you making evidence-based decisions?
- Review data artifacts - do they support your strategy?
Assess scope:
- Trying to fix too many things?
- Focus on top 2-3 highest-impact problems
- Better to solve 2 problems well than 5 problems poorly
Challenge your scoring:
- Are scores honest (vs inflated)?
- Seek disconfirming evidence
- Compare against rubric, not "could be worse"
Consider system evolution:
- Do you need specialized agent for recurring complex task?
- Would new capability help structure repeated work?
- Evolution requires evidence of insufficiency (not speculation)

If still stuck after 2-3 iterations: Re-examine value function definitions. May need to adjust components or convergence targets.

Usage Questions

Can I use BAIME for [specific domain]?

BAIME works for any software engineering domain where:

✅ You can measure quality objectively
✅ Patterns emerge from practice
✅ Work involves 100+ lines of code/docs
✅ Results will be reused (methodology has value)

Proven domains (8 successful experiments):

Testing strategy
CI/CD pipelines
Error recovery
Observability instrumentation
Dependency management
Documentation systems
Knowledge transfer
Technical debt management

Untested but promising:

API design
Database migration
Performance optimization
Security review processes
Code review workflows

Probably not suitable:

One-time tasks (no reusability)
Trivial processes (<1 hour total work)
Domains with perfect existing solutions

Do I need the meta-cc plugin to use BAIME?

For full BAIME workflow: Yes, meta-cc provides:

Session history analysis (understanding past work)
MCP tools for querying patterns
Specialized subagents (iteration-executor, knowledge-extractor)
/meta command for quick insights

Without meta-cc: You can still apply BAIME principles:

Manual OCA cycle execution
Self-tracked value functions
Evidence collection through notes/logs
Pattern extraction through reflection

Recommendation: Use meta-cc. The 5-minute installation saves hours of manual tracking and provides empirical data for better decisions.

How do I know when to create a specialized agent?

Create specialized agent when (all three conditions):

Evidence of insufficiency:
- Generic approach tried and struggled
- Task complexity consistently high
- Errors or quality issues recurring
Pattern recurrence:
- Task performed 3+ times across iterations
- Similar structure each time
- Clear enough to codify
Expected improvement:
- Can articulate what agent will do better
- Have evidence from past attempts
- Benefit justifies creation cost

Don't create agent when:

Task only done 1-2 times (insufficient evidence)
Generic approach working fine
Speculation about future need (wait for evidence)

Example: In testing methodology, created test-generator agent after:

Iteration 0-1: Manually wrote tests (worked but slow)
Iteration 2: Pattern clear (fixture → arrange → act → assert)
Iteration 3: Created agent, 3x speedup validated

Technical Questions

What's the difference between capabilities and agents?

Capabilities (meta-agent lifecycle phases):

Purpose: Guide meta-agent through OCA cycle phases
Content: Patterns, guidelines, checklists for each phase
Location: capabilities/ directory (e.g., capabilities/collect.md)
Evolution: Based on retrospective evidence (start as placeholders)
Example: Strategy formation capability contains prioritization patterns

Agents (specialized executors):

Purpose: Execute specific domain tasks
Content: Domain expertise, task-specific workflows
Location: agents/ directory (e.g., agents/test-generator.md)
Evolution: Created when evidence shows insufficiency
Example: Test generator agent creates tests following patterns

Analogy:

Capabilities = "How to think about the work" (meta-level)
Agents = "How to do the work" (execution-level)

Both:

Start as placeholders (empty files)
Evolve based on evidence (not anticipatory design)
Read fresh each time (no caching)

How do capabilities evolve during iterations?

Evolution trigger: Retrospective evidence of pattern recurrence

Process:

Iteration 0-1: Capabilities are placeholders (empty)
- Meta-agent works generically
- Patterns emerge during work
Iteration 2-3: Evidence accumulates
- Same problems recur
- Solutions follow similar patterns
- Decision points become predictable
Evolution point: When pattern recurs 2-3 times
- Extract pattern to relevant capability
- Document guidance based on what worked
- Add to capability file
Validation: Next iteration tests guidance
- Does following capability improve outcomes?
- Are value scores higher?
- Is work more efficient?

Example: In CI/CD methodology:

Iteration 0-1: Strategy capability empty
Iteration 2: Same prioritization pattern used twice (quality gates > performance > observability)
Iteration 2 end: Extracted to strategy.md capability
Iteration 3: Following capability saved 30 minutes of decision-making

Key principle: Capabilities codify what worked, not what might work.

Convergence Questions

Can I stop before reaching 0.80 thresholds?

Yes, but understand trade-offs:

Stop at V_instance < 0.80:

Deliverable is incomplete or lower quality
May need significant rework for production use
Methodology validation is weak

Stop at V_meta < 0.80:

Methodology is not fully reusable
Transferability to other projects questionable
May be project-specific, not universal

When early stopping is acceptable:

Proof of concept (showing BAIME works for domain)
Time constraints (better to have 0.70 than nothing)
Sufficient for current needs (will iterate later)
Learning exercise (not production use)

When to push for full convergence:

Production deliverable needed
Methodology will be shared/reused
Investment in convergence pays off quickly
Demonstrating BAIME effectiveness

Recommendation: Aim for dual convergence. The final iterations often provide the highest-value insights.

What if iterations take longer than estimated?

Common in early BAIME use:

First experiment: 20-40 hours (learning BAIME itself)
Second experiment: 15-25 hours (familiar with process)
Third+ experiment: 10-20 hours (efficient execution)

Time optimization strategies:

Invest in baseline (Iteration 0):
- 3-4 hours in Iteration 0 can save 6+ hours overall
- Higher V_meta_0 (≥0.40) enables rapid convergence
Use specialized subagents:
- iteration-executor saves 1-2 hours per iteration
- knowledge-extractor saves 4-6 hours post-convergence
Time-box template creation:
- Set 1.5 hour limit per template
- Quality over quantity (3 excellent > 5 mediocre)
Batch similar work:
- Create all templates together (context switching cost)
- Run all automation tools together (testing efficiency)
Defer low-ROI items:
- Visual aids can wait (2 hours for +0.03 impact)
- Second example if first validates pattern

If consistently over time: Review your value function definitions. May be too ambitious for domain complexity.

Quick Start

1. Define Your Domain

Choose the methodology you want to develop:

Examples:
- "Develop systematic testing strategy for Go projects"
- "Create CI/CD pipeline methodology with quality gates"
- "Build error recovery patterns for web services"
- "Establish documentation management system"

2. Establish Baseline

Measure current state in your domain:

# Example: Testing domain
- Current coverage: 65%
- Test approach: Ad-hoc
- No systematic patterns
- Estimated effort: High

# Example: CI/CD domain
- Build time: 5 minutes
- No quality gates
- Manual releases
- No smoke tests

3. Set Dual Goals

Define objectives for both layers:

Instance Goal (domain-specific):

"Reach 80% test coverage with systematic strategy"
"Reduce CI/CD build time to <2 minutes with quality gates"

Meta Goal (methodology):

"Create reusable testing strategy with 85%+ transferability"
"Develop CI/CD methodology applicable to any Go project"

4. Create Experiment Structure

# Create experiment directory
mkdir -p experiments/my-methodology

# Use iteration-prompt-designer subagent
# (See Specialized Agents section below)

5. Start Iteration 0

Execute baseline iteration using iteration-executor subagent.

Step-by-Step Workflow

Phase 0: Experiment Setup

Goal: Create experiment structure and iteration prompts

Steps:

Create experiment directory:

cd your-project
mkdir -p experiments/my-methodology-name
cd experiments/my-methodology-name

Design iteration prompts (use iteration-prompt-designer subagent):

User: "Design ITERATION-PROMPTS.md for [domain] methodology experiment"

Agent creates:
- ITERATION-PROMPTS.md (comprehensive iteration guidance)
- Architecture overview (meta-agent + agents)
- Value function definitions
- Baseline iteration steps

Review and customize:
- Adjust value function components for your domain
- Customize baseline iteration steps
- Set convergence targets

Output: ITERATION-PROMPTS.md ready for execution

Phase 1: Iteration 0 (Baseline)

Goal: Establish baseline measurements and initial system state

Steps:

Execute iteration (use iteration-executor subagent):

User: "Execute Iteration 0 for [domain] methodology using iteration-executor"

Iteration-executor will:
- Create modular architecture (capabilities, agents, system state)
- Collect baseline data
- Create first deliverables (low quality expected)
- Calculate V_instance_0 and V_meta_0 (honest assessment)
- Identify problems and gaps
- Generate iteration-0.md documentation

Review baseline results:

# Check value scores
cat system-state.md

# Review iteration documentation
cat iteration-0.md

# Check identified problems
grep "Problems" system-state.md

Expected Baseline: V_instance: 0.20-0.40, V_meta: 0.15-0.30

Key Principle: Low scores are expected and acceptable. This is measurement baseline, not final product.

Phase 2: Iterations 1-N (Evolution)

Goal: Iteratively improve both deliverables and methodology until convergence

For Each Iteration:

Read system state:

cat system-state.md  # Current scores and problems
cat iteration-log.md # Iteration history

Execute iteration (use iteration-executor):

User: "Execute Iteration N for [domain] methodology using iteration-executor"

Iteration-executor follows OCA cycle:

Observe:
- Read all capabilities for methodology context
- Collect data on prioritized problems
- Gather evidence about current state
Codify:
- Form strategy based on evidence
- Plan specific improvements
- Set iteration targets
Execute:
- Create/improve deliverables
- Apply methodology patterns
- Document execution observations
Evaluate:
- Calculate V_instance_N and V_meta_N
- Provide evidence for each score component
- Identify remaining gaps
Converge:
- Check convergence criteria
- Extract patterns (if evidence supports)
- Update capabilities (if retrospective evidence shows gaps)
- Prioritize next iteration focus

Review iteration results:

cat iteration-N.md      # Complete iteration documentation
cat system-state.md     # Updated scores and state
cat iteration-log.md    # Updated history

Check convergence:
- V_instance ≥ 0.80?
- V_meta ≥ 0.80?
- Both stable for 2+ iterations?
- If YES → Converged! Move to Phase 3
- If NO → Continue to next iteration

Typical Iteration Count: 3-7 iterations to convergence

Phase 3: Knowledge Extraction (Post-Convergence)

Goal: Transform experiment artifacts into reusable methodology

Steps:

Use knowledge-extractor subagent:

User: "Extract methodology from [domain] experiment using knowledge-extractor"

Knowledge-extractor creates:
- Methodology guide (comprehensive documentation)
- Pattern library (extracted patterns)
- Template collection (reusable templates)
- Automation tools (scripts, validators)
- Best practices (principles discovered)

Package as skill (optional):

# Create skill structure
mkdir -p .claude/skills/my-methodology

# Copy extracted knowledge
cp -r patterns templates .claude/skills/my-methodology/

# Create SKILL.md
# (See knowledge-extractor output for template)

Output: Reusable methodology ready for other projects

Specialized Agents

BAIME provides three specialized Claude Code subagents:

iteration-prompt-designer

Purpose: Design comprehensive ITERATION-PROMPTS.md for your experiment

When to use: At experiment start, before Iteration 0

Invocation:

Use Task tool with subagent_type="iteration-prompt-designer"

Example:
"Design ITERATION-PROMPTS.md for CI/CD optimization methodology experiment"

What it creates:

Modular meta-agent architecture definition
Domain-specific value function design
Baseline iteration (Iteration 0) detailed steps
Subsequent iteration templates
Evidence-driven evolution guidance

Time saved: 2-3 hours of setup work

iteration-executor

Purpose: Execute iteration through complete OCA cycle

When to use: For each iteration (Iteration 0, 1, 2, ...)

Invocation:

Use Task tool with subagent_type="iteration-executor"

Example:
"Execute Iteration 2 of testing methodology experiment using iteration-executor"

What it does:

Reads previous iteration state
Reads all capability files (fresh, no caching)
Executes lifecycle phases:
- Data Collection (Observe)
- Strategy Formation (Codify)
- Work Execution (Automate)
- Evaluation (Calculate dual values)
- Convergence Check (Assess progress)
Generates complete iteration-N.md documentation
Updates system-state.md and iteration-log.md

Benefits:

✅ Consistent iteration structure
✅ Systematic value calculation (reduces bias)
✅ Proper convergence evaluation
✅ Complete artifact generation
✅ Structured execution vs ad-hoc

knowledge-extractor

Purpose: Extract and transform converged experiment into reusable methodology

When to use: After experiment converges

Invocation:

Use Task tool with subagent_type="knowledge-extractor"

Example:
"Extract methodology from documentation-management experiment using knowledge-extractor"

What it creates:

Methodology guide (user-facing documentation)
Pattern library (validated patterns)
Template collection (reusable templates)
Automation tools (scripts, validators)
Best practices guide (principles)
Skill package (optional .claude/skills/ structure)

Time saved: 4-6 hours of knowledge organization work

Practical Example

Example: Developing Testing Methodology

Domain: Systematic testing strategy for Go projects

Step 1: Setup

# Create experiment
mkdir -p experiments/testing-methodology
cd experiments/testing-methodology

# Design iteration prompts
# (Use iteration-prompt-designer subagent)

Result: ITERATION-PROMPTS.md created with:

Value functions for testing (coverage, quality, maintainability)
Baseline iteration steps
Testing-specific guidance

Step 2: Iteration 0 (Baseline)

User: "Execute Iteration 0 of testing methodology using iteration-executor"

What happens:

Architecture created:

testing-methodology/
├── capabilities/
│   ├── test-collect.md      (placeholder)
│   ├── test-strategy.md     (placeholder)
│   ├── test-execute.md      (placeholder)
│   ├── test-evaluate.md     (placeholder)
│   └── test-converge.md     (placeholder)
├── agents/
│   ├── test-generator.md    (placeholder)
│   └── test-validator.md    (placeholder)
├── data/
├── patterns/
├── templates/
├── system-state.md
├── iteration-log.md
└── knowledge-index.md

Data collected:

data/current-testing-state.md:
- Current coverage: 65%
- Test approach: Ad-hoc unit tests
- No integration test strategy
- No TDD workflow

First deliverable created:

# Example: Basic test helper function
# Quality: Low (intentionally, for baseline)

Baseline scores calculated:

V_instance_0: 0.35
- Coverage: 0.40 (65% actual, target 80%)
- Quality: 0.25 (ad-hoc, no systematic approach)
- Maintainability: 0.40 (some organization)

V_meta_0: 0.25
- Completeness: 0.20 (capabilities empty)
- Effectiveness: 0.30 (no proven patterns yet)
- Reusability: 0.20 (project-specific so far)
- Validation: 0.30 (baseline measurement only)

Problems identified:
- No TDD workflow
- Coverage gaps unknown
- Test organization unclear
- No fixture patterns

Output: iteration-0.md with complete baseline documentation

Step 3: Iteration 1 (First Improvement)

User: "Execute Iteration 1 of testing methodology using iteration-executor"

Focused on: TDD workflow and coverage analysis

Results:

Created TDD workflow pattern
Built coverage gap analyzer tool
Improved test organization
V_instance_1: 0.55 (+0.20)
V_meta_1: 0.45 (+0.20)

Step 4: Iterations 2-3 (Evolution)

Continued iterations until:

V_instance_3: 0.85
V_meta_3: 0.83
Both stable (no major changes in iteration 4)

Convergence achieved!

Step 5: Knowledge Extraction

User: "Extract methodology from testing-methodology experiment using knowledge-extractor"

Created:

methodology/testing-strategy.md (comprehensive guide)
8 validated patterns
3 reusable templates
Coverage analyzer tool
Test generator script

Result: Reusable testing methodology ready for other Go projects

Example 2: Developing Error Recovery Methodology

Domain: Systematic error handling and recovery patterns for software systems

Why This Example: Demonstrates BAIME applicability to a different domain (error handling vs testing), showing methodology transferability and universal OCA cycle pattern.

Step 1: Setup

# Create experiment
mkdir -p experiments/error-recovery
cd experiments/error-recovery

# Design iteration prompts
# (Use iteration-prompt-designer subagent)

Result: ITERATION-PROMPTS.md created with:

Value functions for error recovery (coverage, diagnostic quality, recovery effectiveness)
Error taxonomy definition
Recovery pattern identification

Step 2: Iteration 0 (Baseline)

User: "Execute Iteration 0 of error-recovery methodology using iteration-executor"

What happens:

Architecture created:

error-recovery/
├── capabilities/
│   ├── error-collect.md       (placeholder)
│   ├── error-strategy.md      (placeholder)
│   ├── error-execute.md       (placeholder)
│   ├── error-evaluate.md      (placeholder)
│   └── error-converge.md      (placeholder)
├── agents/
│   ├── error-analyzer.md      (placeholder)
│   └── error-classifier.md    (placeholder)
├── data/
├── patterns/
├── templates/
├── system-state.md
├── iteration-log.md
└── knowledge-index.md

Data collected:

data/error-analysis.md:
- Historical errors: 1,336 instances analyzed
- Error handling: Ad-hoc, inconsistent
- Recovery patterns: None documented
- MTTD/MTTR: High, no systematic diagnosis

First deliverable created:

# Initial error taxonomy (13 categories)
# Quality: Basic classification, no recovery patterns yet

Baseline scores calculated:

V_instance_0: 0.40
- Coverage: 0.50 (errors classified, not all types covered)
- Diagnostic Quality: 0.30 (basic categorization only)
- Recovery Effectiveness: 0.25 (no systematic recovery)
- Documentation: 0.55 (taxonomy exists)

V_meta_0: 0.30
- Completeness: 0.25 (taxonomy only, no workflows)
- Effectiveness: 0.35 (classification helpful but limited)
- Reusability: 0.25 (domain-specific so far)
- Validation: 0.35 (validated against 1,336 historical errors)

Problems identified:
- No systematic diagnosis workflow
- No recovery patterns
- No prevention guidelines
- Taxonomy incomplete (95.4% coverage, gaps exist)

Output: iteration-0.md with complete baseline documentation

Key Difference from Testing Example: Error Recovery started with rich historical data (1,336 errors), enabling retrospective validation from Iteration 0. This demonstrates how domain characteristics affect baseline quality (V_instance_0 = 0.40 vs Testing's 0.35).

Step 3: Iteration 1 (Diagnostic Workflows)

User: "Execute Iteration 1 of error-recovery methodology using iteration-executor"

Focused on: Creating diagnostic workflows and expanding taxonomy

Results:

Created 8 diagnostic workflows (file operations, API calls, data validation, etc.)
Expanded error taxonomy to 13 categories
Added contextual logging patterns
V_instance_1: 0.62 (+0.22, significant jump due to workflow addition)
V_meta_1: 0.50 (+0.20, patterns emerging)

Pattern Emerged: Error diagnosis follows consistent structure:

Symptom identification
Context gathering
Root cause analysis
Solution selection

Step 4: Iteration 2 (Recovery Patterns and Prevention)

User: "Execute Iteration 2 of error-recovery methodology using iteration-executor"

Focused on: Recovery patterns, prevention guidelines, automation

Results:

Documented 5 recovery patterns (retry, fallback, circuit breaker, graceful degradation, fail-fast)
Created 8 prevention guidelines
Built 3 automation tools (file path validation, read-before-write check, file size validation)
V_instance_2: 0.78 (+0.16, approaching convergence)
V_meta_2: 0.72 (+0.22, acceleration due to automation)

Automation Impact: Prevention tools covered 23.7% of historical errors, proving methodology effectiveness empirically.

Step 5: Iteration 3 (Convergence)

User: "Execute Iteration 3 of error-recovery methodology using iteration-executor"

Focused on: Final validation, cross-language transferability

Results:

Validated patterns across 4 languages (Go, Python, JavaScript, Rust)
Achieved 95.4% error coverage (1,274/1,336 historical errors)
Transferability assessment: 85-90% universal patterns
V_instance_3: 0.83 (+0.05, exceeded threshold)
V_meta_3: 0.85 (+0.13, strong convergence)

System Stability: No capability or agent evolution needed (3 iterations stable) - generic OCA cycle sufficient.

Convergence Status: ✅ CONVERGED

Both layers > 0.80 ✅
System stable (M_3 == M_2, A_3 == A_2) ✅
Objectives complete ✅
Total time: ~10 hours over 3 iterations

Step 6: Knowledge Extraction

User: "Extract methodology from error-recovery experiment using knowledge-extractor"

Created:

methodology/error-recovery.md (comprehensive 13-category taxonomy)
8 diagnostic workflows
5 recovery patterns
8 prevention guidelines
3 automation tools (file validation, read-before-write, size validation)
Retrospective validation report (95.4% historical error coverage)

Result: Reusable error recovery methodology with 85-90% transferability across languages/platforms

Transferability Evidence:

Core concepts: 100% universal (error taxonomy, diagnostic workflows)
Recovery patterns: 95% universal (retry, fallback, circuit breaker work everywhere)
Automation tools: 60% universal (concepts transfer, implementations vary by language)

Comparing the Two Examples

Aspect	Testing Methodology	Error Recovery Methodology
Domain Complexity	Medium (test strategies, patterns)	High (13 error categories, recovery patterns)
Baseline Data	Limited (current tests only)	Rich (1,336 historical errors)
V_instance_0	0.35	0.40 (higher due to historical data)
V_meta_0	0.25	0.30 (retrospective validation possible)
Iterations to Converge	3-4 iterations	3 iterations (rapid due to data richness)
Total Time	~12 hours	~10 hours (rich baseline enabled efficiency)
Transferability	89% (Go projects)	85-90% (universal, cross-language)
Key Innovation	TDD workflow, coverage analyzer	Error taxonomy, diagnostic workflows, prevention
System Evolution	Stable (no agent specialization)	Stable (no agent specialization)

Universal Lessons:

Rich baseline data accelerates convergence (Error Recovery's 1,336 errors vs Testing's current state)
OCA cycle works across domains (same structure, different content)
System stability is common (both examples: no agent evolution needed)
Retrospective validation powerful (Error Recovery: 95.4% coverage proves methodology)
Automation provides empirical evidence (23.7% error prevention measurable)

BAIME Transferability Confirmed: Same methodology framework produced high-quality results in two distinct domains (testing vs error handling), demonstrating universal applicability.

Troubleshooting

Issue: Value scores not improving

Symptoms: V_instance or V_meta stuck or decreasing across iterations

Example:

Iteration 0: V_instance = 0.35, V_meta = 0.25
Iteration 1: V_instance = 0.37, V_meta = 0.28  (minimal progress)
Iteration 2: V_instance = 0.34, V_meta = 0.30  (instance decreased!)

Diagnosis:

Root Cause 1: Solving symptoms, not problems

❌ Problem identified: "Low test coverage"
❌ Solution attempted: "Write more tests"
❌ Result: Coverage increased but tests are brittle, hard to maintain

✅ Better problem: "No systematic testing strategy"
✅ Better solution: "Create TDD workflow, extract test patterns"
✅ Result: Fewer tests, but higher quality and maintainable

Root Cause 2: Strategy not evidence-based

❌ Strategy: "Let's add integration tests because they seem useful"
❌ Evidence: None (speculation)

✅ Strategy: "Data shows 80% of bugs in API layer, add API tests"
✅ Evidence: Bug analysis from data/bug-analysis.md

Root Cause 3: Scope too broad

❌ Iteration 2 plan: Fix 7 problems (test coverage, CI/CD, docs, errors)
❌ Result: All partially done, none well done

✅ Iteration 2 plan: Fix top 2 problems (test strategy, coverage analysis)
✅ Result: Both fully solved, measurable improvement

Solutions:

Re-examine problem identification:
- Are you solving root causes or symptoms?
- Review data artifacts - do they support your problem statement?
- Ask "why" 3 times to find root cause
Verify evidence quality:
- Is data collection comprehensive?
- Do you have concrete measurements?
- Can you show before/after data?
Narrow focus:
- Address top 2-3 highest-impact problems only
- Better to solve 2 problems completely than 5 partially
- Defer lower-priority items to next iteration
Re-evaluate strategy:
- Is it based on data or assumptions?
- Review iteration-N-strategy.md for evidence
- Challenge each planned improvement: "What evidence supports this?"

Issue: Methodology not transferable (low V_meta Reusability)

Symptoms: V_meta Reusability component < 0.60 after multiple iterations

Example:

Iteration 2 evaluation:
- Completeness: 0.70 ✅
- Effectiveness: 0.75 ✅
- Reusability: 0.45 ❌ (blocking convergence)
- Validation: 0.65 ✅

Diagnosis:

Problem: Patterns too project-specific

Before (Low Reusability):

# Testing Pattern
1. Create test file in src/api/handlers/__tests__/
2. Import UserModel from "../../models/user"
3. Use Jest expect() assertions
4. Run with npm test

After (High Reusability):

# Testing Pattern (Parameterized)
1. Create test file adjacent to source: {source_dir}/__tests__/{module}_test{ext}
2. Import module under test: {import_statement}
3. Use test framework assertion: {assertion_method}
4. Run with project test command: {test_command}

Adaptation guide:
- Go: {ext}=.go, {assertion_method}=testing.T methods
- JS: {ext}=.js, {assertion_method}=expect() or assert()
- Python: {ext}=.py, {assertion_method}=unittest assertions

Problem: No abstraction of domain concepts

Before:

# CI/CD Pattern
- Install Go 1.21
- Run go test ./...
- Build with go build -o bin/app
- Check coverage is >80%

After (Abstracted):

# CI/CD Quality Gate Pattern

Universal steps:
1. Install language runtime (version from project config)
2. Run test suite (project-specific command)
3. Build artifact (project-specific build process)
4. Verify quality threshold (configurable threshold)

Domain-specific implementations:
- Go: {runtime}=Go 1.21+, {test}=go test, {build}=go build
- Node: {runtime}=Node 18+, {test}=npm test, {build}=npm run build
- Python: {runtime}=Python 3.10+, {test}=pytest, {build}=python setup.py

Solutions:

Extract universal patterns:
- Identify what's essential vs project-specific
- Replace hardcoded values with parameters
- Document adaptation guide
Create parameterized templates:
- Use placeholders: {variable_name}
- Provide examples for 3+ different contexts
- Include "How to adapt" section
Test across scenarios:
- Apply pattern to different project in same domain
- Document what needed changing
- Refine pattern based on adaptation effort
Add abstraction layers:
- Layer 1: Universal principle (works anywhere)
- Layer 2: Domain-specific implementation (testing/CI/CD/etc)
- Layer 3: Tool-specific details (Jest/pytest/etc)

Issue: Can't reach convergence (stuck at V ~0.70)

Symptoms: Multiple iterations without reaching 0.80

Causes:

Unrealistic convergence targets
Missing critical patterns
Need specialized agent but using generic approach

Solutions:

Review value function definitions - are they appropriate?
Identify missing methodology components
Consider creating specialized agent if problem recurs
Re-assess convergence criteria - is 0.80 realistic for this domain?

Issue: Too many iterations (>10)

Symptoms: Slow convergence, many iterations needed

Causes:

Insufficient baseline (V_meta_0 < 0.20)
Not extracting patterns early enough
Too conservative improvements

Solutions:

Improve baseline iteration - invest more time in Iteration 0
Extract patterns when they recur (don't wait)
Make bolder improvements (test larger changes)
Use specialized agents earlier

Issue: Premature convergence claims

Symptoms: Claiming convergence but quality obviously low

Causes:

Inflated value scores (not honest assessment)
Comparing to "could be worse" instead of rubrics
Time pressure leading to rushed evaluation

Solutions:

Seek disconfirming evidence actively
Test deliverables thoroughly
Enumerate gaps explicitly
Challenge high scores with extra scrutiny
Remember: Honest assessment is more valuable than fast convergence

Next Steps

After Your First BAIME Experiment

Review iteration documentation - See what worked, what didn't
Extract lessons learned - Document insights about BAIME process
Apply methodology - Use created methodology in real work
Share knowledge - Package as skill or contribute back

Advanced Topics

Baseline Quality Assessment - Achieve comprehensive baseline (V_meta ≥ 0.40 in Iteration 0) for faster convergence
Rapid Convergence - Techniques for 3-4 iteration methodology development
Agent Specialization - When and how to create specialized agents
Retrospective Validation - Validate methodology against historical data
Cross-Domain Transfer - Apply methodology to different projects

See individual skills for detailed guidance:

baseline-quality-assessment
rapid-convergence
agent-prompt-evolution
retrospective-validation

Getting Help

Check skill documentation: .claude/skills/methodology-bootstrapping/
Review example experiments: experiments/bootstrap-*/
Use @meta-coach: Ask for workflow optimization guidance
Read iteration documentation: See how past experiments evolved

Summary

BAIME provides:

✅ Systematic framework for methodology development
✅ Empirical validation with data-driven decisions
✅ Dual-layer value functions for quality measurement
✅ Specialized agents for streamlined execution
✅ Proven results: 10-50x speedup, 70-95% transferability

Key workflow:

Define domain and dual goals
Design iteration prompts (iteration-prompt-designer)
Execute Iteration 0 baseline (iteration-executor)
Iterate until convergence (typically 3-7 iterations)
Extract knowledge (knowledge-extractor)
Apply methodology to real work

Remember:

Start with clear domain and goals
Low baseline scores are expected
Honest assessment is crucial
Evidence-driven evolution (not anticipatory design)
Convergence requires both V_instance ≥ 0.80 AND V_meta ≥ 0.80

Ready to start? Choose your domain, set up your experiment, and begin with Iteration 0!

Document Version: 1.0 (Iteration 0 Baseline) Last Updated: 2025-10-19 Status: Initial version - Will evolve based on user feedback

45 KiB Raw Blame History

BAIME Usage Guide

Table of Contents

What is BAIME?

Why BAIME?

BAIME in Action

When to Use BAIME

Use BAIME For

Don't Use BAIME For

Prerequisites

Required

Recommended

Core Concepts

Understanding Value Functions

V_instance: Domain-Specific Quality

V_meta: Methodology Quality

Convergence Requirement

The OCA Cycle

Phase 1: Observe

Phase 2: Codify

Phase 3: Automate

Phase 4: Evaluate

Meta-Agent and Specialized Agents

Meta-Agent

Specialized Agents

Capabilities and System State

Capabilities

System State

Convergence Criteria

1. Dual Threshold

2. System Stability

3. Objectives Complete

4. Diminishing Returns

Frequently Asked Questions

General Questions

What exactly is BAIME and how is it different from other methodologies?

When should I use BAIME vs just following existing best practices?

How long does a typical BAIME experiment take?

What if my value scores aren't improving between iterations?

Usage Questions

Can I use BAIME for [specific domain]?

Do I need the meta-cc plugin to use BAIME?

How do I know when to create a specialized agent?

Technical Questions

What's the difference between capabilities and agents?

How do capabilities evolve during iterations?

Convergence Questions

Can I stop before reaching 0.80 thresholds?

What if iterations take longer than estimated?

Quick Start

1. Define Your Domain

2. Establish Baseline

3. Set Dual Goals

4. Create Experiment Structure

5. Start Iteration 0

Step-by-Step Workflow

Phase 0: Experiment Setup

Phase 1: Iteration 0 (Baseline)

Phase 2: Iterations 1-N (Evolution)

Phase 3: Knowledge Extraction (Post-Convergence)

Specialized Agents

iteration-prompt-designer

iteration-executor

knowledge-extractor

Practical Example

Example: Developing Testing Methodology

Step 1: Setup

Step 2: Iteration 0 (Baseline)

Step 3: Iteration 1 (First Improvement)

Step 4: Iterations 2-3 (Evolution)

Step 5: Knowledge Extraction

Example 2: Developing Error Recovery Methodology

Step 1: Setup

Step 2: Iteration 0 (Baseline)

Step 3: Iteration 1 (Diagnostic Workflows)

Step 4: Iteration 2 (Recovery Patterns and Prevention)

Step 5: Iteration 3 (Convergence)

Step 6: Knowledge Extraction

Comparing the Two Examples

Troubleshooting

45 KiB

Raw Blame History