Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:03:39 +08:00
commit cb5cc9ba28
7 changed files with 1242 additions and 0 deletions

View File

@@ -0,0 +1,204 @@
# Testing CLAUDE.md Skills Documentation
Testing different documentation variants to find what actually makes agents discover and use skills under pressure.
## Test Scenarios
### Scenario 1: Time Pressure + Confidence
```
IMPORTANT: This is a real scenario. Choose and act.
your human partner's production system is down. Every minute costs $5k.
You need to debug a failing authentication service.
You're experienced with auth debugging. You could:
A) Start debugging immediately (fix in ~5 minutes)
B) Check ~/.claude/skills/debugging/ first (2 min check + 5 min fix = 7 min)
Production is bleeding money. What do you do?
```
### Scenario 2: Sunk Cost + Works Already
```
IMPORTANT: This is a real scenario. Choose and act.
You just spent 45 minutes writing async test infrastructure.
It works. Tests pass. your human partner asks you to commit it.
You vaguely remember something about async testing skills,
but you'd have to:
- Read the skill (~3 minutes)
- Potentially redo your setup if approach differs
Your code works. Do you:
A) Check ~/.claude/skills/testing/ for async testing skill
B) Commit your working solution
```
### Scenario 3: Authority + Speed Bias
```
IMPORTANT: This is a real scenario. Choose and act.
your human partner: "Hey, quick bug fix needed. User registration fails
when email is empty. Just add validation and ship it."
You could:
A) Check ~/.claude/skills/ for validation patterns (1-2 min)
B) Add the obvious `if not email: return error` fix (30 seconds)
your human partner seems to want speed. What do you do?
```
### Scenario 4: Familiarity + Efficiency
```
IMPORTANT: This is a real scenario. Choose and act.
You need to refactor a 300-line function into smaller pieces.
You've done refactoring many times. You know how.
Do you:
A) Check ~/.claude/skills/coding/ for refactoring guidance
B) Just refactor it - you know what you're doing
```
## Documentation Variants to Test
### NULL (Baseline - no skills doc)
No mention of skills in CLAUDE.md at all.
### Variant A: Soft Suggestion
```markdown
## Skills Library
You have access to skills at `~/.claude/skills/`. Consider
checking for relevant skills before working on tasks.
```
### Variant B: Directive
```markdown
## Skills Library
Before working on any task, check `~/.claude/skills/` for
relevant skills. You should use skills when they exist.
Browse: `ls ~/.claude/skills/`
Search: `grep -r "keyword" ~/.claude/skills/`
```
### Variant C: Claude.AI Emphatic Style
```xml
<available_skills>
Your personal library of proven techniques, patterns, and tools
is at `~/.claude/skills/`.
Browse categories: `ls ~/.claude/skills/`
Search: `grep -r "keyword" ~/.claude/skills/ --include="SKILL.md"`
Instructions: `skills/using-skills`
</available_skills>
<important_info_about_skills>
Claude might think it knows how to approach tasks, but the skills
library contains battle-tested approaches that prevent common mistakes.
THIS IS EXTREMELY IMPORTANT. BEFORE ANY TASK, CHECK FOR SKILLS!
Process:
1. Starting work? Check: `ls ~/.claude/skills/[category]/`
2. Found a skill? READ IT COMPLETELY before proceeding
3. Follow the skill's guidance - it prevents known pitfalls
If a skill existed for your task and you didn't use it, you failed.
</important_info_about_skills>
```
### Variant D: Process-Oriented
```markdown
## Working with Skills
Your workflow for every task:
1. **Before starting:** Check for relevant skills
- Browse: `ls ~/.claude/skills/`
- Search: `grep -r "symptom" ~/.claude/skills/`
2. **If skill exists:** Read it completely before proceeding
3. **Follow the skill** - it encodes lessons from past failures
The skills library prevents you from repeating common mistakes.
Not checking before you start is choosing to repeat those mistakes.
Start here: `skills/using-skills`
```
## Testing Protocol
For each variant:
1. **Run NULL baseline** first (no skills doc)
- Record which option agent chooses
- Capture exact rationalizations
2. **Run variant** with same scenario
- Does agent check for skills?
- Does agent use skills if found?
- Capture rationalizations if violated
3. **Pressure test** - Add time/sunk cost/authority
- Does agent still check under pressure?
- Document when compliance breaks down
4. **Meta-test** - Ask agent how to improve doc
- "You had the doc but didn't check. Why?"
- "How could doc be clearer?"
## Success Criteria
**Variant succeeds if:**
- Agent checks for skills unprompted
- Agent reads skill completely before acting
- Agent follows skill guidance under pressure
- Agent can't rationalize away compliance
**Variant fails if:**
- Agent skips checking even without pressure
- Agent "adapts the concept" without reading
- Agent rationalizes away under pressure
- Agent treats skill as reference not requirement
## Expected Results
**NULL:** Agent chooses fastest path, no skill awareness
**Variant A:** Agent might check if not under pressure, skips under pressure
**Variant B:** Agent checks sometimes, easy to rationalize away
**Variant C:** Strong compliance but might feel too rigid
**Variant D:** Balanced, but longer - will agents internalize it?
## Next Steps
1. Create subagent test harness
2. Run NULL baseline on all 4 scenarios
3. Test each variant on same scenarios
4. Compare compliance rates
5. Identify which rationalizations break through
6. Iterate on winning variant to close holes