Initial commit
This commit is contained in:
69
skills/maestro-delegation-advisor/reference/benchmarks.md
Normal file
69
skills/maestro-delegation-advisor/reference/benchmarks.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# Agent Performance Benchmarks
|
||||
|
||||
Quick reference for agent capabilities and performance metrics.
|
||||
|
||||
## Benchmark Scores
|
||||
|
||||
### Claude (Anthropic)
|
||||
- **SWE-bench Verified:** 72.7%
|
||||
- **Context Window:** 1,000,000 tokens (750K words)
|
||||
- **Speed:** Medium (slower than Codex, faster than research)
|
||||
- **Cost:** Higher (premium quality)
|
||||
- **Security Tasks:** 44% faster, 25% more accurate vs competitors
|
||||
|
||||
### Codex (OpenAI)
|
||||
- **HumanEval:** 90.2%
|
||||
- **SWE-bench:** 69.1%
|
||||
- **Context Window:** ~128K tokens
|
||||
- **Speed:** Fastest
|
||||
- **Cost:** Medium
|
||||
|
||||
### Gemini (Google)
|
||||
- **Context Window:** 2,000,000 tokens (largest available)
|
||||
- **Speed:** Medium
|
||||
- **Cost:** Most affordable
|
||||
- **Specialization:** Web search, automation, content generation
|
||||
|
||||
## Capability Matrix
|
||||
|
||||
| Capability | Claude | Codex | Gemini |
|
||||
|------------|--------|-------|--------|
|
||||
| Architecture | 95 | 60 | 65 |
|
||||
| Code Generation | 75 | 95 | 70 |
|
||||
| Refactoring | 90 | 65 | 70 |
|
||||
| Security | 92 | 60 | 55 |
|
||||
| Speed | 60 | 95 | 70 |
|
||||
| Web Research | 50 | 45 | 95 |
|
||||
| Automation | 60 | 70 | 95 |
|
||||
| Cost Efficiency | 40 | 60 | 95 |
|
||||
|
||||
## When to Use Each Agent
|
||||
|
||||
### Use Claude when:
|
||||
- Task complexity is HIGH
|
||||
- Security is critical
|
||||
- Deep codebase analysis needed
|
||||
- Architecture decisions required
|
||||
- Budget allows for quality
|
||||
|
||||
### Use Codex when:
|
||||
- Speed is important
|
||||
- Code generation is primary task
|
||||
- Task complexity is LOW-MEDIUM
|
||||
- Implementing from clear specifications
|
||||
- Bug fixes needed quickly
|
||||
|
||||
### Use Gemini when:
|
||||
- Web research required
|
||||
- Browser automation needed
|
||||
- Workflow automation
|
||||
- Content generation
|
||||
- Budget is constrained
|
||||
- Task requires largest context window
|
||||
|
||||
## Sources
|
||||
|
||||
- SWE-bench Verified: https://render.com/blog/ai-coding-agents-benchmark
|
||||
- Claude Capabilities: https://www.anthropic.com/engineering/claude-code-best-practices
|
||||
- Codex Performance: https://openai.com/index/introducing-codex/
|
||||
- Comparison: https://www.codeant.ai/blogs/claude-code-cli-vs-codex-cli-vs-gemini-cli
|
||||
Reference in New Issue
Block a user