1006 lines
29 KiB
Markdown
1006 lines
29 KiB
Markdown
---
|
||
name: multi-model-validation
|
||
description: Run multiple AI models in parallel for 3-5x speedup. Use when validating with Grok, Gemini, GPT-5, DeepSeek, or Claudish proxy for code review, consensus analysis, or multi-expert validation. Trigger keywords - "grok", "gemini", "gpt-5", "deepseek", "claudish", "multiple models", "parallel review", "external AI", "consensus", "multi-model".
|
||
version: 0.1.0
|
||
tags: [orchestration, claudish, parallel, consensus, multi-model, grok, gemini, external-ai]
|
||
keywords: [grok, gemini, gpt-5, deepseek, claudish, parallel, consensus, multi-model, external-ai, proxy, openrouter]
|
||
---
|
||
|
||
# Multi-Model Validation
|
||
|
||
**Version:** 1.0.0
|
||
**Purpose:** Patterns for running multiple AI models in parallel via Claudish proxy
|
||
**Status:** Production Ready
|
||
|
||
## Overview
|
||
|
||
Multi-model validation is the practice of running multiple AI models (Grok, Gemini, GPT-5, DeepSeek, etc.) in parallel to validate code, designs, or implementations from different perspectives. This achieves:
|
||
|
||
- **3-5x speedup** via parallel execution (15 minutes → 5 minutes)
|
||
- **Consensus-based prioritization** (issues flagged by all models are CRITICAL)
|
||
- **Diverse perspectives** (different models catch different issues)
|
||
- **Cost transparency** (know before you spend)
|
||
|
||
The key innovation is the **4-Message Pattern**, which ensures true parallel execution by using only Task tool calls in a single message, avoiding the sequential execution trap caused by mixing tool types.
|
||
|
||
This skill is extracted from the `/review` command and generalized for use in any multi-model workflow.
|
||
|
||
## Core Patterns
|
||
|
||
### Pattern 1: The 4-Message Pattern (MANDATORY)
|
||
|
||
This pattern is **CRITICAL** for achieving true parallel execution with multiple AI models.
|
||
|
||
**Why This Pattern Exists:**
|
||
|
||
Claude Code executes tools **sequentially by default** when different tool types are mixed in the same message. To achieve true parallelism, you MUST:
|
||
1. Use ONLY one tool type per message
|
||
2. Ensure all Task calls are in a single message
|
||
3. Separate preparation (Bash) from execution (Task) from presentation
|
||
|
||
**The Pattern:**
|
||
|
||
```
|
||
Message 1: Preparation (Bash Only)
|
||
- Create workspace directories
|
||
- Validate inputs (check if claudish installed)
|
||
- Write context files (code to review, design reference, etc.)
|
||
- NO Task calls
|
||
- NO TodoWrite calls
|
||
|
||
Message 2: Parallel Execution (Task Only)
|
||
- Launch ALL AI models in SINGLE message
|
||
- ONLY Task tool calls
|
||
- Separate each Task with --- delimiter
|
||
- Each Task is independent (no dependencies)
|
||
- All execute simultaneously
|
||
|
||
Message 3: Auto-Consolidation (Task Only)
|
||
- Automatically triggered when N ≥ 2 models complete
|
||
- Launch consolidation agent
|
||
- Pass all review file paths
|
||
- Apply consensus analysis
|
||
|
||
Message 4: Present Results
|
||
- Show user prioritized issues
|
||
- Include consensus levels (unanimous, strong, majority)
|
||
- Link to detailed reports
|
||
- Cost summary (if applicable)
|
||
```
|
||
|
||
**Example: 5-Model Parallel Code Review**
|
||
|
||
```
|
||
Message 1: Preparation
|
||
Bash: mkdir -p ai-docs/reviews
|
||
Bash: git diff > ai-docs/code-review-context.md
|
||
Bash: which claudish (check if installed)
|
||
|
||
Message 2: Parallel Execution
|
||
Task: senior-code-reviewer
|
||
Prompt: "Review ai-docs/code-review-context.md for security issues.
|
||
Write detailed review to ai-docs/reviews/claude-review.md
|
||
Return only brief summary."
|
||
---
|
||
Task: codex-code-reviewer PROXY_MODE: x-ai/grok-code-fast-1
|
||
Prompt: "Review ai-docs/code-review-context.md for security issues.
|
||
Write detailed review to ai-docs/reviews/grok-review.md
|
||
Return only brief summary."
|
||
---
|
||
Task: codex-code-reviewer PROXY_MODE: google/gemini-2.5-flash
|
||
Prompt: "Review ai-docs/code-review-context.md for security issues.
|
||
Write detailed review to ai-docs/reviews/gemini-review.md
|
||
Return only brief summary."
|
||
---
|
||
Task: codex-code-reviewer PROXY_MODE: openai/gpt-5-codex
|
||
Prompt: "Review ai-docs/code-review-context.md for security issues.
|
||
Write detailed review to ai-docs/reviews/gpt5-review.md
|
||
Return only brief summary."
|
||
---
|
||
Task: codex-code-reviewer PROXY_MODE: deepseek/deepseek-coder
|
||
Prompt: "Review ai-docs/code-review-context.md for security issues.
|
||
Write detailed review to ai-docs/reviews/deepseek-review.md
|
||
Return only brief summary."
|
||
|
||
All 5 models execute simultaneously (5x parallelism!)
|
||
|
||
Message 3: Auto-Consolidation
|
||
(Automatically triggered - don't wait for user to request)
|
||
|
||
Task: senior-code-reviewer
|
||
Prompt: "Consolidate 5 code reviews from:
|
||
- ai-docs/reviews/claude-review.md
|
||
- ai-docs/reviews/grok-review.md
|
||
- ai-docs/reviews/gemini-review.md
|
||
- ai-docs/reviews/gpt5-review.md
|
||
- ai-docs/reviews/deepseek-review.md
|
||
|
||
Apply consensus analysis:
|
||
- Issues flagged by ALL 5 → UNANIMOUS (VERY HIGH confidence)
|
||
- Issues flagged by 4 → STRONG (HIGH confidence)
|
||
- Issues flagged by 3 → MAJORITY (MEDIUM confidence)
|
||
- Issues flagged by 1-2 → DIVERGENT (LOW confidence)
|
||
|
||
Prioritize by consensus level and severity.
|
||
Write to ai-docs/consolidated-review.md"
|
||
|
||
Message 4: Present Results
|
||
"Multi-model code review complete! 5 AI models analyzed your code.
|
||
|
||
Top 5 Issues (Prioritized by Consensus):
|
||
1. [UNANIMOUS] Missing input validation on POST /api/users
|
||
2. [UNANIMOUS] SQL injection risk in search endpoint
|
||
3. [STRONG] Weak password hashing (bcrypt rounds too low)
|
||
4. [MAJORITY] Missing rate limiting on authentication endpoints
|
||
5. [MAJORITY] Insufficient error handling in payment flow
|
||
|
||
See ai-docs/consolidated-review.md for complete analysis."
|
||
```
|
||
|
||
**Performance Impact:**
|
||
|
||
- Sequential execution: 5 models × 3 min = 15 minutes
|
||
- Parallel execution: max(model times) ≈ 5 minutes
|
||
- **Speedup: 3x with perfect parallelism**
|
||
|
||
---
|
||
|
||
### Pattern 2: Parallel Execution Architecture
|
||
|
||
**Single Message, Multiple Tasks:**
|
||
|
||
The key to parallel execution is putting ALL Task calls in a **single message** with the `---` delimiter:
|
||
|
||
```
|
||
✅ CORRECT - Parallel Execution:
|
||
|
||
Task: agent1
|
||
Prompt: "Task 1 instructions"
|
||
---
|
||
Task: agent2
|
||
Prompt: "Task 2 instructions"
|
||
---
|
||
Task: agent3
|
||
Prompt: "Task 3 instructions"
|
||
|
||
All 3 execute simultaneously.
|
||
```
|
||
|
||
**Anti-Pattern: Sequential Execution**
|
||
|
||
```
|
||
❌ WRONG - Sequential Execution:
|
||
|
||
Message 1:
|
||
Task: agent1
|
||
|
||
Message 2:
|
||
Task: agent2
|
||
|
||
Message 3:
|
||
Task: agent3
|
||
|
||
Each task waits for previous to complete (3x slower).
|
||
```
|
||
|
||
**Independent Tasks Requirement:**
|
||
|
||
Each Task must be **independent** (no dependencies):
|
||
|
||
```
|
||
✅ CORRECT - Independent:
|
||
Task: review code for security
|
||
Task: review code for performance
|
||
Task: review code for style
|
||
|
||
All can run simultaneously (same input, different perspectives).
|
||
|
||
❌ WRONG - Dependent:
|
||
Task: implement feature
|
||
Task: write tests for feature (depends on implementation)
|
||
Task: review implementation (depends on tests)
|
||
|
||
Must run sequentially (each needs previous output).
|
||
```
|
||
|
||
**Unique Output Files:**
|
||
|
||
Each Task MUST write to a **unique output file** to avoid conflicts:
|
||
|
||
```
|
||
✅ CORRECT - Unique Files:
|
||
Task: reviewer1 → ai-docs/reviews/claude-review.md
|
||
Task: reviewer2 → ai-docs/reviews/grok-review.md
|
||
Task: reviewer3 → ai-docs/reviews/gemini-review.md
|
||
|
||
❌ WRONG - Shared File:
|
||
Task: reviewer1 → ai-docs/review.md
|
||
Task: reviewer2 → ai-docs/review.md (overwrites reviewer1!)
|
||
Task: reviewer3 → ai-docs/review.md (overwrites reviewer2!)
|
||
```
|
||
|
||
**Wait for All Before Consolidation:**
|
||
|
||
Do NOT consolidate until ALL tasks complete:
|
||
|
||
```
|
||
✅ CORRECT - Wait for All:
|
||
Launch: Task1, Task2, Task3, Task4 (parallel)
|
||
Wait: All 4 complete
|
||
Check: results.filter(r => r.status === 'fulfilled').length
|
||
If >= 2: Proceed with consolidation
|
||
If < 2: Offer retry or abort
|
||
|
||
❌ WRONG - Premature Consolidation:
|
||
Launch: Task1, Task2, Task3, Task4
|
||
After 30s: Task1, Task2 done
|
||
Consolidate: Only Task1 + Task2 (Task3, Task4 still running!)
|
||
```
|
||
|
||
---
|
||
|
||
### Pattern 3: Proxy Mode Implementation
|
||
|
||
**PROXY_MODE Directive:**
|
||
|
||
External AI models are invoked via the PROXY_MODE directive in agent prompts:
|
||
|
||
```
|
||
Task: codex-code-reviewer PROXY_MODE: x-ai/grok-code-fast-1
|
||
Prompt: "Review code for security issues..."
|
||
```
|
||
|
||
**Agent Behavior:**
|
||
|
||
When an agent sees PROXY_MODE, it:
|
||
|
||
```
|
||
1. Detects PROXY_MODE directive in incoming prompt
|
||
2. Extracts model name (e.g., "x-ai/grok-code-fast-1")
|
||
3. Extracts actual task (everything after PROXY_MODE line)
|
||
4. Constructs claudish command:
|
||
printf '%s' "AGENT_PROMPT" | claudish --model x-ai/grok-code-fast-1 --stdin --quiet --auto-approve
|
||
5. Executes SYNCHRONOUSLY (blocking, waits for full response)
|
||
6. Captures full output
|
||
7. Writes detailed results to file (ai-docs/grok-review.md)
|
||
8. Returns BRIEF summary only (2-5 sentences)
|
||
```
|
||
|
||
**Critical: Blocking Execution**
|
||
|
||
External model calls MUST be **synchronous (blocking)** so the agent waits for completion:
|
||
|
||
```
|
||
✅ CORRECT - Blocking (Synchronous):
|
||
RESULT=$(printf '%s' "$PROMPT" | claudish --model grok --stdin --quiet --auto-approve)
|
||
echo "$RESULT" > ai-docs/grok-review.md
|
||
echo "Grok review complete. See ai-docs/grok-review.md"
|
||
|
||
❌ WRONG - Background (Asynchronous):
|
||
printf '%s' "$PROMPT" | claudish --model grok --stdin --quiet --auto-approve &
|
||
echo "Grok review started..." # Agent returns immediately, review not done!
|
||
```
|
||
|
||
**Why Blocking Matters:**
|
||
|
||
If agents return before external models complete, the orchestrator will:
|
||
- Think all reviews are done (they're not)
|
||
- Try to consolidate partial results (missing data)
|
||
- Present incomplete results to user (bad experience)
|
||
|
||
**Output Strategy:**
|
||
|
||
Agents write **full detailed output to file** and return **brief summary only**:
|
||
|
||
```
|
||
Full Output (ai-docs/grok-review.md):
|
||
"# Code Review by Grok
|
||
|
||
## Security Issues
|
||
|
||
### CRITICAL: SQL Injection in User Search
|
||
The search endpoint constructs SQL queries using string concatenation...
|
||
[500 more lines of detailed analysis]"
|
||
|
||
Brief Summary (returned to orchestrator):
|
||
"Grok review complete. Found 3 CRITICAL, 5 HIGH, 12 MEDIUM issues.
|
||
See ai-docs/grok-review.md for details."
|
||
```
|
||
|
||
**Why Brief Summaries:**
|
||
|
||
- Orchestrator doesn't need full 500-line review in context
|
||
- Full review is in file for consolidation agent
|
||
- Keeps orchestrator context clean (context efficiency)
|
||
|
||
**Auto-Approve Flag:**
|
||
|
||
Use `--auto-approve` flag to prevent interactive prompts:
|
||
|
||
```
|
||
✅ CORRECT - Auto-Approve:
|
||
claudish --model grok --stdin --quiet --auto-approve
|
||
|
||
❌ WRONG - Interactive (blocks waiting for user input):
|
||
claudish --model grok --stdin --quiet
|
||
# Waits for user to approve costs... but this is inside an agent!
|
||
```
|
||
|
||
---
|
||
|
||
### Pattern 4: Cost Estimation and Transparency
|
||
|
||
**Input/Output Token Separation:**
|
||
|
||
Provide separate estimates for input and output tokens:
|
||
|
||
```
|
||
Cost Estimation for Multi-Model Review:
|
||
|
||
Input Tokens (per model):
|
||
- Code context: 500 lines × 1.5 = 750 tokens
|
||
- Review instructions: 200 tokens
|
||
- Total input per model: ~1000 tokens
|
||
- Total input (5 models): 5,000 tokens
|
||
|
||
Output Tokens (per model):
|
||
- Expected output: 2,000 - 4,000 tokens
|
||
- Total output (5 models): 10,000 - 20,000 tokens
|
||
|
||
Cost Calculation (example rates):
|
||
- Input: 5,000 tokens × $0.0001/1k = $0.0005
|
||
- Output: 15,000 tokens × $0.0005/1k = $0.0075 (3-5x more expensive)
|
||
- Total: $0.0080 (range: $0.0055 - $0.0105)
|
||
|
||
User Approval Gate:
|
||
"Multi-model review will cost approximately $0.008 ($0.005 - $0.010).
|
||
Proceed? (Yes/No)"
|
||
```
|
||
|
||
**Input Token Estimation Formula:**
|
||
|
||
```
|
||
Input Tokens = (Code Lines × 1.5) + Instruction Tokens
|
||
|
||
Why 1.5x multiplier?
|
||
- Code lines: ~1 token per line (average)
|
||
- Context overhead: +50% (imports, comments, whitespace)
|
||
|
||
Example:
|
||
500 lines of code → 500 × 1.5 = 750 tokens
|
||
+ 200 instruction tokens = 950 tokens total input
|
||
```
|
||
|
||
**Output Token Estimation Formula:**
|
||
|
||
```
|
||
Output Tokens = Base Estimate + Complexity Factor
|
||
|
||
Base Estimates by Task Type:
|
||
- Code review: 2,000 - 4,000 tokens
|
||
- Design validation: 1,000 - 2,000 tokens
|
||
- Architecture planning: 3,000 - 6,000 tokens
|
||
- Bug investigation: 2,000 - 5,000 tokens
|
||
|
||
Complexity Factors:
|
||
- Simple (< 100 lines code): Use low end of range
|
||
- Medium (100-500 lines): Use mid-range
|
||
- Complex (> 500 lines): Use high end of range
|
||
|
||
Example:
|
||
400 lines of complex code → 4,000 tokens (high complexity)
|
||
50 lines of simple code → 2,000 tokens (low complexity)
|
||
```
|
||
|
||
**Range-Based Estimates:**
|
||
|
||
Always provide a **range** (min-max), not a single number:
|
||
|
||
```
|
||
✅ CORRECT - Range:
|
||
"Estimated cost: $0.005 - $0.010 (depends on review depth)"
|
||
|
||
❌ WRONG - Single Number:
|
||
"Estimated cost: $0.0075"
|
||
(User surprised when actual is $0.0095)
|
||
```
|
||
|
||
**Why Output Costs More:**
|
||
|
||
Output tokens are typically **3-5x more expensive** than input tokens:
|
||
|
||
```
|
||
Example Pricing (OpenRouter):
|
||
- Grok: $0.50 / 1M input, $1.50 / 1M output (3x difference)
|
||
- Gemini Flash: $0.10 / 1M input, $0.40 / 1M output (4x difference)
|
||
- GPT-5 Codex: $1.00 / 1M input, $5.00 / 1M output (5x difference)
|
||
|
||
Impact:
|
||
If input = 5,000 tokens, output = 15,000 tokens:
|
||
Input cost: $0.0005
|
||
Output cost: $0.0075 (15x higher despite only 3x more tokens)
|
||
Total: $0.0080 (94% is output!)
|
||
```
|
||
|
||
**User Approval Before Execution:**
|
||
|
||
ALWAYS ask for user approval before expensive operations:
|
||
|
||
```
|
||
Present to user:
|
||
"You selected 5 AI models for code review:
|
||
- Claude Sonnet (embedded, free)
|
||
- Grok Code Fast (external, $0.002)
|
||
- Gemini 2.5 Flash (external, $0.001)
|
||
- GPT-5 Codex (external, $0.004)
|
||
- DeepSeek Coder (external, $0.001)
|
||
|
||
Estimated total cost: $0.008 ($0.005 - $0.010)
|
||
|
||
Proceed with multi-model review? (Yes/No)"
|
||
|
||
If user says NO:
|
||
Offer alternatives:
|
||
1. Use only free embedded Claude
|
||
2. Select fewer models
|
||
3. Cancel review
|
||
|
||
If user says YES:
|
||
Proceed with Message 2 (parallel execution)
|
||
```
|
||
|
||
---
|
||
|
||
### Pattern 5: Auto-Consolidation Logic
|
||
|
||
**Automatic Trigger:**
|
||
|
||
Consolidation should happen **automatically** when N ≥ 2 reviews complete:
|
||
|
||
```
|
||
✅ CORRECT - Auto-Trigger:
|
||
|
||
const results = await Promise.allSettled([task1, task2, task3, task4, task5]);
|
||
const successful = results.filter(r => r.status === 'fulfilled');
|
||
|
||
if (successful.length >= 2) {
|
||
// Auto-trigger consolidation (DON'T wait for user to ask)
|
||
const consolidated = await Task({
|
||
subagent_type: "senior-code-reviewer",
|
||
description: "Consolidate reviews",
|
||
prompt: `Consolidate ${successful.length} reviews and apply consensus analysis`
|
||
});
|
||
|
||
return formatResults(consolidated);
|
||
} else {
|
||
// Too few successful reviews
|
||
notifyUser("Only 1 model succeeded. Retry failures or abort?");
|
||
}
|
||
|
||
❌ WRONG - Wait for User:
|
||
|
||
const results = await Promise.allSettled([...]);
|
||
const successful = results.filter(r => r.status === 'fulfilled');
|
||
|
||
// Present results to user
|
||
notifyUser("3 reviews complete. Would you like me to consolidate them?");
|
||
// Waits for user to request consolidation...
|
||
```
|
||
|
||
**Why Auto-Trigger:**
|
||
|
||
- Better UX (no extra user prompt needed)
|
||
- Faster workflow (no wait for user response)
|
||
- Expected behavior (user assumes consolidation is part of workflow)
|
||
|
||
**Minimum Threshold:**
|
||
|
||
Require **at least 2 successful reviews** for meaningful consensus:
|
||
|
||
```
|
||
if (successful.length >= 2) {
|
||
// Proceed with consolidation
|
||
} else if (successful.length === 1) {
|
||
// Only 1 review succeeded
|
||
notifyUser("Only 1 model succeeded. No consensus available. See single review or retry?");
|
||
} else {
|
||
// All failed
|
||
notifyUser("All models failed. Check logs and retry?");
|
||
}
|
||
```
|
||
|
||
**Pass All Review File Paths:**
|
||
|
||
Consolidation agent needs paths to ALL review files:
|
||
|
||
```
|
||
Task: senior-code-reviewer
|
||
Prompt: "Consolidate reviews from these files:
|
||
- ai-docs/reviews/claude-review.md
|
||
- ai-docs/reviews/grok-review.md
|
||
- ai-docs/reviews/gemini-review.md
|
||
|
||
Apply consensus analysis and prioritize issues."
|
||
```
|
||
|
||
**Don't Inline Full Reviews:**
|
||
|
||
```
|
||
❌ WRONG - Inline Reviews (context pollution):
|
||
Prompt: "Consolidate these reviews:
|
||
|
||
Claude Review:
|
||
[500 lines of review content]
|
||
|
||
Grok Review:
|
||
[500 lines of review content]
|
||
|
||
Gemini Review:
|
||
[500 lines of review content]"
|
||
|
||
✅ CORRECT - File Paths:
|
||
Prompt: "Read and consolidate reviews from:
|
||
- ai-docs/reviews/claude-review.md
|
||
- ai-docs/reviews/grok-review.md
|
||
- ai-docs/reviews/gemini-review.md"
|
||
```
|
||
|
||
---
|
||
|
||
### Pattern 6: Consensus Analysis
|
||
|
||
**Consensus Levels:**
|
||
|
||
Classify issues by how many models flagged them:
|
||
|
||
```
|
||
Consensus Levels (for N models):
|
||
|
||
UNANIMOUS (100% agreement):
|
||
- All N models flagged this issue
|
||
- VERY HIGH confidence
|
||
- MUST FIX priority
|
||
|
||
STRONG CONSENSUS (67-99% agreement):
|
||
- Most models flagged this issue (⌈2N/3⌉ to N-1)
|
||
- HIGH confidence
|
||
- RECOMMENDED priority
|
||
|
||
MAJORITY (50-66% agreement):
|
||
- Half or more models flagged this issue (⌈N/2⌉ to ⌈2N/3⌉-1)
|
||
- MEDIUM confidence
|
||
- CONSIDER priority
|
||
|
||
DIVERGENT (< 50% agreement):
|
||
- Only 1-2 models flagged this issue
|
||
- LOW confidence
|
||
- OPTIONAL priority (may be model-specific perspective)
|
||
```
|
||
|
||
**Example: 5 Models**
|
||
|
||
```
|
||
Issue Flagged By: Consensus Level: Priority:
|
||
─────────────────────────────────────────────────────────────
|
||
All 5 models UNANIMOUS (100%) MUST FIX
|
||
4 models STRONG (80%) RECOMMENDED
|
||
3 models MAJORITY (60%) CONSIDER
|
||
2 models DIVERGENT (40%) OPTIONAL
|
||
1 model DIVERGENT (20%) OPTIONAL
|
||
```
|
||
|
||
**Keyword-Based Matching (v1.0):**
|
||
|
||
Simple consensus analysis using keyword matching:
|
||
|
||
```
|
||
Algorithm:
|
||
|
||
1. Extract issues from each review
|
||
2. For each unique issue:
|
||
a. Identify keywords (e.g., "SQL injection", "input validation")
|
||
b. Check which other reviews mention same keywords
|
||
c. Count models that flagged this issue
|
||
d. Assign consensus level
|
||
|
||
Example:
|
||
|
||
Claude Review: "Missing input validation on POST /api/users"
|
||
Grok Review: "Input validation absent in user creation endpoint"
|
||
Gemini Review: "No validation for user POST endpoint"
|
||
|
||
Keywords: ["input validation", "POST", "/api/users", "user"]
|
||
Match: All 3 reviews mention these keywords
|
||
Consensus: UNANIMOUS (3/3 = 100%)
|
||
```
|
||
|
||
**Model Agreement Matrix:**
|
||
|
||
Show which models agree on which issues:
|
||
|
||
```
|
||
Issue Matrix:
|
||
|
||
Issue Claude Grok Gemini GPT-5 DeepSeek Consensus
|
||
──────────────────────────────────────────────────────────────────────────────────
|
||
SQL injection in search ✓ ✓ ✓ ✓ ✓ UNANIMOUS
|
||
Missing input validation ✓ ✓ ✓ ✓ ✗ STRONG
|
||
Weak password hashing ✓ ✓ ✓ ✗ ✗ MAJORITY
|
||
Missing rate limiting ✓ ✓ ✗ ✗ ✗ DIVERGENT
|
||
Insufficient error handling ✓ ✗ ✗ ✗ ✗ DIVERGENT
|
||
```
|
||
|
||
**Prioritized Issue List:**
|
||
|
||
Sort issues by consensus level, then by severity:
|
||
|
||
```
|
||
Top 10 Issues (Prioritized):
|
||
|
||
1. [UNANIMOUS - CRITICAL] SQL injection in search endpoint
|
||
Flagged by: Claude, Grok, Gemini, GPT-5, DeepSeek (5/5)
|
||
|
||
2. [UNANIMOUS - HIGH] Missing input validation on POST /api/users
|
||
Flagged by: Claude, Grok, Gemini, GPT-5, DeepSeek (5/5)
|
||
|
||
3. [STRONG - HIGH] Weak password hashing (bcrypt rounds too low)
|
||
Flagged by: Claude, Grok, Gemini, GPT-5 (4/5)
|
||
|
||
4. [STRONG - MEDIUM] Missing rate limiting on auth endpoints
|
||
Flagged by: Claude, Grok, Gemini, GPT-5 (4/5)
|
||
|
||
5. [MAJORITY - MEDIUM] Insufficient error handling in payment flow
|
||
Flagged by: Claude, Grok, Gemini (3/5)
|
||
|
||
... (remaining issues)
|
||
```
|
||
|
||
**Future Enhancement (v1.1+): Semantic Similarity**
|
||
|
||
```
|
||
Instead of keyword matching, use semantic similarity:
|
||
- Embed issue descriptions with sentence-transformers
|
||
- Calculate cosine similarity between embeddings
|
||
- Issues with >0.8 similarity are "same issue"
|
||
- More accurate consensus detection
|
||
```
|
||
|
||
---
|
||
|
||
## Integration with Other Skills
|
||
|
||
**multi-model-validation + quality-gates:**
|
||
|
||
```
|
||
Use Case: Cost approval before expensive multi-model review
|
||
|
||
Step 1: Cost Estimation (multi-model-validation)
|
||
Calculate input/output tokens
|
||
Estimate cost range
|
||
|
||
Step 2: User Approval Gate (quality-gates)
|
||
Present cost estimate
|
||
Ask user for approval
|
||
If NO: Offer alternatives or abort
|
||
If YES: Proceed with execution
|
||
|
||
Step 3: Parallel Execution (multi-model-validation)
|
||
Follow 4-Message Pattern
|
||
Launch all models simultaneously
|
||
```
|
||
|
||
**multi-model-validation + error-recovery:**
|
||
|
||
```
|
||
Use Case: Handling external model failures gracefully
|
||
|
||
Step 1: Parallel Execution (multi-model-validation)
|
||
Launch 5 external models
|
||
|
||
Step 2: Error Handling (error-recovery)
|
||
Model 1: Success
|
||
Model 2: Timeout after 30s → Skip, continue with others
|
||
Model 3: API 500 error → Retry once, then skip
|
||
Model 4: Success
|
||
Model 5: Success
|
||
|
||
Step 3: Partial Success Strategy (error-recovery)
|
||
3/5 models succeeded (≥ 2 threshold)
|
||
Proceed with consolidation using 3 reviews
|
||
Notify user: "2 models failed, proceeding with 3 reviews"
|
||
|
||
Step 4: Consolidation (multi-model-validation)
|
||
Consolidate 3 successful reviews
|
||
Apply consensus analysis
|
||
```
|
||
|
||
**multi-model-validation + todowrite-orchestration:**
|
||
|
||
```
|
||
Use Case: Real-time progress tracking during parallel execution
|
||
|
||
Step 1: Initialize TodoWrite (todowrite-orchestration)
|
||
Tasks:
|
||
1. Prepare workspace
|
||
2. Launch Claude review
|
||
3. Launch Grok review
|
||
4. Launch Gemini review
|
||
5. Launch GPT-5 review
|
||
6. Consolidate reviews
|
||
7. Present results
|
||
|
||
Step 2: Update Progress (todowrite-orchestration)
|
||
Mark tasks complete as models finish:
|
||
- Claude completes → Mark task 2 complete
|
||
- Grok completes → Mark task 3 complete
|
||
- Gemini completes → Mark task 4 complete
|
||
- GPT-5 completes → Mark task 5 complete
|
||
|
||
Step 3: User Sees Real-Time Progress
|
||
"3/4 external models completed, 1 in progress..."
|
||
```
|
||
|
||
---
|
||
|
||
## Best Practices
|
||
|
||
**Do:**
|
||
- ✅ Use 4-Message Pattern for true parallel execution
|
||
- ✅ Provide cost estimates BEFORE execution
|
||
- ✅ Ask user approval for costs >$0.01
|
||
- ✅ Auto-trigger consolidation when N ≥ 2 reviews complete
|
||
- ✅ Use blocking (synchronous) claudish execution
|
||
- ✅ Write full output to files, return brief summaries
|
||
- ✅ Prioritize by consensus level (unanimous → strong → majority → divergent)
|
||
- ✅ Show model agreement matrix
|
||
- ✅ Handle partial success gracefully (some models fail)
|
||
|
||
**Don't:**
|
||
- ❌ Mix tool types in Message 2 (breaks parallelism)
|
||
- ❌ Use background claudish execution (returns before completion)
|
||
- ❌ Wait for user to request consolidation (auto-trigger instead)
|
||
- ❌ Consolidate with < 2 successful reviews (no meaningful consensus)
|
||
- ❌ Inline full reviews in consolidation prompt (use file paths)
|
||
- ❌ Return full 500-line reviews to orchestrator (use brief summaries)
|
||
- ❌ Skip cost approval gate for expensive operations
|
||
|
||
**Performance:**
|
||
- Parallel execution: 3-5x faster than sequential
|
||
- Message 2 speedup: 15 min → 5 min with 5 models
|
||
- Context efficiency: Brief summaries save 50-80% context
|
||
|
||
---
|
||
|
||
## Examples
|
||
|
||
### Example 1: 3-Model Code Review with Cost Approval
|
||
|
||
**Scenario:** User requests "Review my changes with Grok and Gemini"
|
||
|
||
**Execution:**
|
||
|
||
```
|
||
Message 1: Preparation
|
||
Bash: mkdir -p ai-docs/reviews
|
||
Bash: git diff > ai-docs/code-review-context.md
|
||
Bash: wc -l ai-docs/code-review-context.md
|
||
Output: 450 lines
|
||
|
||
Calculate costs:
|
||
Input: 450 × 1.5 = 675 tokens per model
|
||
Output: 2000-4000 tokens per model
|
||
Total: 3 models × (675 input + 3000 output avg) = 11,025 tokens
|
||
Cost: ~$0.006 ($0.004 - $0.008)
|
||
|
||
Message 2: User Approval Gate (quality-gates skill)
|
||
Present to user:
|
||
"Multi-model review will analyze 450 lines of code with 3 AI models:
|
||
- Claude Sonnet (embedded, free)
|
||
- Grok Code Fast ($0.002)
|
||
- Gemini 2.5 Flash ($0.001)
|
||
|
||
Estimated cost: $0.006 ($0.004 - $0.008)
|
||
|
||
Proceed? (Yes/No)"
|
||
|
||
User: "Yes"
|
||
|
||
Message 3: Parallel Execution (Task only)
|
||
Task: senior-code-reviewer
|
||
Prompt: "Review ai-docs/code-review-context.md.
|
||
Write to ai-docs/reviews/claude-review.md
|
||
Return brief summary."
|
||
---
|
||
Task: codex-code-reviewer PROXY_MODE: x-ai/grok-code-fast-1
|
||
Prompt: "Review ai-docs/code-review-context.md.
|
||
Write to ai-docs/reviews/grok-review.md
|
||
Return brief summary."
|
||
---
|
||
Task: codex-code-reviewer PROXY_MODE: google/gemini-2.5-flash
|
||
Prompt: "Review ai-docs/code-review-context.md.
|
||
Write to ai-docs/reviews/gemini-review.md
|
||
Return brief summary."
|
||
|
||
All 3 execute simultaneously (3x speedup)
|
||
|
||
Message 4: Auto-Consolidation (automatic)
|
||
results.length = 3 (all succeeded)
|
||
3 ≥ 2 ✓ (threshold met)
|
||
|
||
Task: senior-code-reviewer
|
||
Prompt: "Consolidate 3 reviews from:
|
||
- ai-docs/reviews/claude-review.md
|
||
- ai-docs/reviews/grok-review.md
|
||
- ai-docs/reviews/gemini-review.md
|
||
|
||
Apply consensus analysis."
|
||
Output: ai-docs/consolidated-review.md
|
||
|
||
Message 5: Present Results
|
||
"Multi-model review complete! 3 AI models analyzed 450 lines.
|
||
|
||
Top 5 Issues (Consensus):
|
||
1. [UNANIMOUS] SQL injection in search endpoint
|
||
2. [UNANIMOUS] Missing input validation on user creation
|
||
3. [STRONG] Weak password hashing (2/3 models)
|
||
4. [MAJORITY] Missing rate limiting (2/3 models)
|
||
5. [DIVERGENT] Code style inconsistency (1/3 model)
|
||
|
||
Actual cost: $0.0058 (within estimate)
|
||
See ai-docs/consolidated-review.md for details."
|
||
```
|
||
|
||
**Result:** 5 minutes total, consensus-based prioritization, cost transparency
|
||
|
||
---
|
||
|
||
### Example 2: Partial Success with Error Recovery
|
||
|
||
**Scenario:** 4 models selected, 2 fail
|
||
|
||
**Execution:**
|
||
|
||
```
|
||
Message 1: Preparation
|
||
(same as Example 1)
|
||
|
||
Message 2: Parallel Execution
|
||
Task: senior-code-reviewer (embedded)
|
||
Task: PROXY_MODE grok (external)
|
||
Task: PROXY_MODE gemini (external)
|
||
Task: PROXY_MODE gpt-5-codex (external)
|
||
|
||
Message 3: Error Recovery (error-recovery skill)
|
||
results = await Promise.allSettled([...]);
|
||
|
||
Results:
|
||
- Claude: Success ✓
|
||
- Grok: Timeout after 30s ✗
|
||
- Gemini: API 500 error ✗
|
||
- GPT-5: Success ✓
|
||
|
||
successful.length = 2 (Claude + GPT-5)
|
||
2 ≥ 2 ✓ (threshold met, can proceed)
|
||
|
||
Notify user:
|
||
"2/4 models succeeded (Grok timeout, Gemini error).
|
||
Proceeding with consolidation using 2 reviews."
|
||
|
||
Message 4: Auto-Consolidation
|
||
Task: senior-code-reviewer
|
||
Prompt: "Consolidate 2 reviews from:
|
||
- ai-docs/reviews/claude-review.md
|
||
- ai-docs/reviews/gpt5-review.md
|
||
|
||
Note: Only 2 models (Grok and Gemini failed)."
|
||
|
||
Message 5: Present Results
|
||
"Multi-model review complete (2/4 models succeeded).
|
||
|
||
Top Issues (2-model consensus):
|
||
1. [UNANIMOUS] SQL injection (both flagged)
|
||
2. [DIVERGENT] Input validation (Claude only)
|
||
3. [DIVERGENT] Rate limiting (GPT-5 only)
|
||
|
||
Note: Grok and Gemini failed. Limited consensus data.
|
||
See ai-docs/consolidated-review.md for details."
|
||
```
|
||
|
||
**Result:** Graceful degradation, useful results despite failures
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
**Problem: Models executing sequentially instead of parallel**
|
||
|
||
Cause: Mixed tool types in Message 2
|
||
|
||
Solution: Use ONLY Task calls in Message 2
|
||
|
||
```
|
||
❌ Wrong:
|
||
Message 2:
|
||
TodoWrite({...})
|
||
Task({...})
|
||
Task({...})
|
||
|
||
✅ Correct:
|
||
Message 1: TodoWrite({...}) (separate message)
|
||
Message 2: Task({...}); Task({...}) (only Task)
|
||
```
|
||
|
||
---
|
||
|
||
**Problem: Agent returns before external model completes**
|
||
|
||
Cause: Background claudish execution
|
||
|
||
Solution: Use synchronous (blocking) execution
|
||
|
||
```
|
||
❌ Wrong:
|
||
claudish --model grok ... &
|
||
|
||
✅ Correct:
|
||
RESULT=$(claudish --model grok ...)
|
||
```
|
||
|
||
---
|
||
|
||
**Problem: Consolidation never triggers**
|
||
|
||
Cause: Waiting for user to request it
|
||
|
||
Solution: Auto-trigger when N ≥ 2 reviews complete
|
||
|
||
```
|
||
❌ Wrong:
|
||
if (results.length >= 2) {
|
||
notifyUser("Ready to consolidate. Proceed?");
|
||
// Waits for user...
|
||
}
|
||
|
||
✅ Correct:
|
||
if (results.length >= 2) {
|
||
// Auto-trigger, don't wait
|
||
await consolidate();
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
**Problem: Costs higher than estimated**
|
||
|
||
Cause: Underestimated output tokens
|
||
|
||
Solution: Use range-based estimates, bias toward high end
|
||
|
||
```
|
||
✅ Better Estimation:
|
||
Output: 3,000 - 5,000 tokens (range, not single number)
|
||
Cost: $0.005 - $0.010 (gives user realistic expectation)
|
||
```
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
Multi-model validation achieves 3-5x speedup and consensus-based prioritization through:
|
||
|
||
- **4-Message Pattern** (true parallel execution)
|
||
- **Blocking proxy execution** (agents wait for external models)
|
||
- **Auto-consolidation** (triggered when N ≥ 2 reviews complete)
|
||
- **Consensus analysis** (unanimous → strong → majority → divergent)
|
||
- **Cost transparency** (estimate before, report after)
|
||
- **Error recovery** (graceful degradation on failures)
|
||
|
||
Master this skill and you can validate any implementation with multiple AI perspectives in minutes.
|
||
|
||
---
|
||
|
||
**Extracted From:**
|
||
- `/review` command (complete multi-model review orchestration)
|
||
- `CLAUDE.md` Parallel Multi-Model Execution Protocol
|
||
- `shared/skills/claudish-usage` (proxy mode patterns)
|