Files
2025-11-29 17:54:49 +08:00

21 KiB

description, argument-hint
description argument-hint
Evaluate GitHub issue quality and completeness for agent implementation
issue number or URL

EVI - Evaluate Issue Quality

Evaluates a GitHub issue to ensure it contains everything needed for the agent implementation framework to succeed.

Input

Issue Number or URL:

$ARGUMENTS

Agent Framework Requirements

The issue must support all agents in the pipeline:

issue-implementer needs:

  • Clear requirements and end state
  • Files/components to create or modify
  • Edge cases and error handling specs
  • Testing expectations

quality-checker needs:

  • Testable acceptance criteria matching automated checks
  • Performance benchmarks (if applicable)
  • Expected test outcomes

security-checker needs:

  • Security considerations and requirements
  • Authentication/authorization specs
  • Data handling requirements

doc-checker needs:

  • Documentation update requirements
  • What needs README/wiki updates
  • API documentation expectations

review-orchestrator needs:

  • Clear pass/fail criteria
  • Non-negotiable vs nice-to-have requirements

issue-merger needs:

  • Unambiguous "done" definition
  • Integration requirements

Evaluation Criteria

CRITICAL (Must Have)

1. Clear Requirements

  • Exact end state specified
  • Technical details: names, types, constraints, behavior
  • Files/components to create or modify
  • Validation rules and error handling
  • Edge cases covered

2. Testable Acceptance Criteria

  • Specific, measurable outcomes
  • Aligns with automated checks (pytest, ESLint, TypeScript, build)
  • Includes edge cases
  • References quality checks: "All tests pass", "ESLint passes", "Build succeeds"

3. Affected Components

  • Which files/modules to modify
  • Which APIs/endpoints involved
  • Which database tables affected
  • Which UI components changed

4. Testing Expectations

  • What tests need to be written
  • What tests need to pass
  • Performance benchmarks (if applicable)
  • Integration test requirements

5. Context & Why

  • Business value
  • User impact
  • Current limitation
  • Why this matters now

⚠️ IMPORTANT (Should Have)

6. Security Requirements

  • Authentication/authorization needs
  • Data privacy considerations
  • Input validation requirements
  • Security vulnerabilities to avoid

7. Documentation Requirements

  • What needs README updates
  • What needs wiki/API docs
  • Inline code comments expected
  • FEATURE-LIST.md updates

8. Error Handling

  • Expected error messages
  • Error codes to return
  • Fallback behavior
  • User-facing error text

9. Scope Boundaries

  • What IS included
  • What is NOT included
  • Out of scope items
  • Future work references

💡 HELPFUL (Nice to Have)

10. Performance Requirements (OPTIONAL - only if specific concern)

  • Response time limits
  • Query performance expectations
  • Scale requirements
  • Load handling
  • Note: Most issues don't need this. Build first, measure, optimize later.

11. Related Issues

  • Dependencies (blocked by, depends on)
  • Related work
  • Follow-up issues planned

12. Implementation Guidance

  • Problems agent needs to solve
  • Existing patterns to follow
  • Challenges to consider
  • No prescriptive solutions (guides, doesn't prescribe)

RED FLAGS (Must NOT Have)

13. Prescriptive Implementation

  • Complete function/component implementations (full solutions)
  • Large blocks of working code (> 10-15 lines)
  • Complete SQL migration scripts
  • Step-by-step implementation guide
  • "Add this code to line X" with specific file/line fixes

OK to have:

  • Short code examples (< 10 lines): { error: 'message', code: 400 }
  • Type definitions: { email: string, category?: string }
  • Example API responses: { id: 5, status: 'active' }
  • Error message formats: "Invalid email format"
  • Small syntax examples showing the shape/format

Evaluation Process

STEP 1: Fetch the Issue

ISSUE_NUM=$ARGUMENTS

# Fetch issue content
gh issue view $ISSUE_NUM --json title,body,labels > issue-data.json

TITLE=$(jq -r '.title' issue-data.json)
BODY=$(jq -r '.body' issue-data.json)
LABELS=$(jq -r '.labels[].name' issue-data.json)

echo "Evaluating Issue #${ISSUE_NUM}: ${TITLE}"

STEP 2: Check Agent Framework Compatibility

Evaluate for each agent in the pipeline:

For issue-implementer:

# Check: Clear requirements present?
echo "$BODY" | grep -iE '(requirements|end state|target state)' || echo "⚠️ No clear requirements section"

# Check: Files/components specified?
echo "$BODY" | grep -iE '(files? (to )?(create|modify)|components? (to )?(create|modify|affected)|tables? (to )?(create|modify))' || echo "⚠️ No affected files/components specified"

# Check: Edge cases covered?
echo "$BODY" | grep -iE '(edge cases?|error handling|fallback|validation)' || echo "⚠️ No edge cases or error handling mentioned"

For quality-checker:

# Check: Acceptance criteria reference automated checks?
echo "$BODY" | grep -iE '(tests? pass|eslint|flake8|mypy|pytest|typescript|build succeeds|linting passes)' || echo "⚠️ Acceptance criteria don't reference automated quality checks"

# Note: Performance requirements are optional - don't check for them

For security-checker:

# Check: Security requirements present if handling sensitive data?
if echo "$BODY" | grep -iE '(password|token|secret|api.?key|auth|credential|email|private)'; then
  echo "$BODY" | grep -iE '(security|authentication|authorization|encrypt|sanitize|validate|sql.?injection|xss)' || echo "⚠️ Handles sensitive data but no security requirements"
fi

For doc-checker:

# Check: Documentation requirements specified?
echo "$BODY" | grep -iE '(documentation|readme|wiki|api.?docs?|comments?|feature.?list)' || echo "⚠️ No documentation requirements specified"

For review-orchestrator:

# Check: Clear pass/fail criteria?
CRITERIA_COUNT=$(echo "$BODY" | grep -c '- \[.\]' || echo "0")
if [ "$CRITERIA_COUNT" -lt 3 ]; then
  echo "⚠️ Only $CRITERIA_COUNT acceptance criteria (need at least 3-5)"
fi

STEP 3: Evaluate Testable Acceptance Criteria

Check if criteria align with automated checks:

Good criteria (matches quality-checker):

  • "All pytest tests pass" → quality-checker runs pytest
  • "ESLint passes with no errors" → quality-checker runs ESLint
  • "TypeScript compilation succeeds" → quality-checker runs tsc
  • "Query completes in < 100ms" → measurable, testable
  • "Can insert 5 accounts per user" → specific, verifiable

Bad criteria (vague/unmeasurable):

  • "Works correctly" → subjective
  • "Code looks good" → subjective
  • "Function created" → process-oriented, not outcome
  • "Bug is fixed" → not specific enough

STEP 4: Check for Affected Components

Does issue specify:

  • Which files to create?
  • Which files to modify?
  • Which APIs/endpoints involved?
  • Which database tables affected?
  • Which UI components changed?
  • Which tests need updating?

STEP 5: Evaluate Testing Expectations

Does issue specify:

  • What tests to write (unit, integration, e2e)?
  • What existing tests need to pass?
  • What test coverage is expected?
  • Performance test requirements?

STEP 6: Check Security Requirements

Security (if applicable):

  • Authentication/authorization requirements?
  • Input validation specs?
  • Data privacy considerations?
  • Known vulnerabilities to avoid?

Note on Performance: Performance requirements are OPTIONAL and NOT evaluated as a critical check. Most issues don't need explicit performance requirements - build first, measure, optimize later. Only flag as missing if:

  • Issue specifically mentions performance as a concern
  • Feature handles large-scale data (millions of records)
  • There's a user-facing latency requirement

Otherwise, absence of performance requirements is fine.

STEP 7: Check Documentation Requirements

Does issue specify:

  • README updates needed?
  • Wiki/API docs updates?
  • Inline comments expected?
  • FEATURE-LIST.md updates?

STEP 8: Scan for Red Flags

# RED FLAG: Large code blocks (> 15 lines)
# Count lines in code blocks
CODE_BLOCKS=$(grep -E '```(typescript|javascript|python|sql|java|go|rust)' <<< "$BODY")
if [ ! -z "$CODE_BLOCKS" ]; then
  # Check if any code block has > 15 lines
  # (This is a simplified check - full implementation would count lines per block)
  echo "⚠️ Found code blocks - checking size..."
  # Manual review needed: Are these short examples or full implementations?
fi

# RED FLAG: Complete function implementations
grep -iE '(function|def|class).*\{' <<< "$BODY" && echo "🚨 Contains complete function/class implementations"

# RED FLAG: Prescriptive instructions with specific fixes
grep -iE '(add (this|the following) code to line [0-9]+|here is the implementation|use this exact code)' <<< "$BODY" && echo "🚨 Contains prescriptive code placement instructions"

# RED FLAG: Specific file/line references for bug fixes
grep -E '(fix|change|modify|add).*(in|at|on) line [0-9]+' <<< "$BODY" && echo "🚨 Contains specific file/line fix locations"

# RED FLAG: Step-by-step implementation guide (not just planning)
grep -iE '(step [0-9]|first,.*second,.*third)' <<< "$BODY" | grep -iE '(write this code|add this function|implement as follows)' && echo "🚨 Contains step-by-step code implementation guide"

# OK: Short examples are fine
# These are NOT red flags:
# - Type definitions: { email: string }
# - Example responses: { id: 5, status: 'active' }
# - Error formats: "Invalid email"
# - Small syntax examples (< 5 lines)

Output Format

# Issue #${ISSUE_NUM} Evaluation Report

**Title:** ${TITLE}

## Agent Framework Compatibility: X/9 Critical Checks Passed

**Ready for implementation?** [YES / NEEDS_WORK / NO]

**Note:** Performance, Related Issues, and Implementation Guidance are optional - not counted in score.

### ✅ Strengths (What's Good)
- [Specific strength 1]
- [Specific strength 2]
- [Specific strength 3]

### ⚠️ Needs Improvement (Fix Before Implementation)
- [Missing element with specific impact on agent]
- [Missing element with specific impact on agent]
- [Missing element with specific impact on agent]

### ❌ Critical Issues (Blocks Agent Success)
- [Critical missing element / red flag]
- [Critical missing element / red flag]

### 🚨 Red Flags Found
- [Code snippet / prescriptive instruction found]
- [Specific file/line reference found]

## Agent-by-Agent Analysis

### issue-implementer Readiness: [READY / NEEDS_WORK / BLOCKED]
**Can the implementer succeed with this issue?**

✅ **Has:**
- Clear requirements and end state
- Files/components to modify specified
- Edge cases covered

❌ **Missing:**
- Error handling specifications
- Validation rules not detailed

**Impact:** [Description of how missing elements affect implementer]

### quality-checker Readiness: [READY / NEEDS_WORK / BLOCKED]
**Can the quality checker validate this?**

✅ **Has:**
- Acceptance criteria reference "pytest passes"
- Performance benchmark: "< 100ms"

❌ **Missing:**
- No reference to ESLint/TypeScript checks
- Test coverage expectations not specified

**Impact:** [Description of how missing elements affect quality validation]

### security-checker Readiness: [READY / NEEDS_WORK / N/A]
**Are security requirements clear?**

✅ **Has:**
- Authentication requirements specified
- Input validation rules defined

❌ **Missing:**
- No SQL injection prevention mentioned
- Data encryption requirements unclear

**Impact:** [Description of security gaps]

### doc-checker Readiness: [READY / NEEDS_WORK / BLOCKED]
**Are documentation expectations clear?**

✅ **Has:**
- README update requirement specified

❌ **Missing:**
- No API documentation mentioned
- FEATURE-LIST.md updates not specified

**Impact:** [Description of doc gaps]

### review-orchestrator Readiness: [READY / NEEDS_WORK / BLOCKED]
**Are pass/fail criteria clear?**

✅ **Has:**
- 5 specific acceptance criteria
- Clear success conditions

❌ **Missing:**
- Distinction between blocking vs non-blocking issues
- Performance criteria not measurable

**Impact:** [Description of review ambiguity]

## Recommendations

### High Priority (Fix Before Implementation)
1. [Specific actionable fix]
2. [Specific actionable fix]
3. [Specific actionable fix]

### Medium Priority (Improve Clarity)
1. [Specific suggestion]
2. [Specific suggestion]

### Low Priority (Nice to Have)
1. [Optional improvement]
2. [Optional improvement]

## Example Improvements

### Before (Current):
```markdown
[Quote problematic section from issue]

After (Suggested):

[Show how it should be written]

Agent Framework Compatibility

Will this issue work well with the agent framework?

  • YES - Issue is well-structured for agents
  • ⚠️ MAYBE - Needs improvements but workable
  • NO - Critical issues will confuse agents

Specific concerns:

  • [Agent compatibility issue 1]
  • [Agent compatibility issue 2]

Quick Fixes

If you want to improve this issue, run:

# Add these sections
gh issue comment $ISSUE_NUM --body "## Implementation Guidance

To implement this, you will need to:
- [List problems to solve]
"

# Remove code snippets
# Edit the issue and remove code blocks showing implementation
gh issue edit $ISSUE_NUM

# Add acceptance criteria
gh issue comment $ISSUE_NUM --body "## Additional Acceptance Criteria
- [ ] [Specific testable criterion]
"

Summary

Ready for agent implementation? [YES/NO/NEEDS_WORK]

Confidence level: [HIGH/MEDIUM/LOW]

Estimated time to fix issues: [X minutes]


## Agent Framework Compatibility Score

**12 Critical Checks (Pass/Fail):**

### ✅ CRITICAL (Must Pass) - 5 checks
1. **Clear Requirements** - End state specified with technical details
2. **Testable Acceptance Criteria** - Aligns with automated checks (pytest, ESLint, etc.)
3. **Affected Components** - Files/modules to create or modify listed
4. **Testing Expectations** - What tests to write/pass specified
5. **Context** - Why, who, business value explained

### ⚠️ IMPORTANT (Should Pass) - 4 checks
6. **Security Requirements** - If handling sensitive data/auth
7. **Documentation Requirements** - README/wiki/API docs expectations
8. **Error Handling** - Error messages, codes, fallback behavior
9. **Scope Boundaries** - What IS and ISN'T included

### 💡 HELPFUL (Nice to Pass) - 3 checks
10. **Performance Requirements** - OPTIONAL, only if specific concern mentioned
11. **Related Issues** - Dependencies, related work
12. **Implementation Guidance** - Problems to solve (without prescribing HOW)

### ❌ RED FLAG (Must NOT Have) - 1 check
13. **No Prescriptive Implementation** - No code snippets, algorithms, step-by-step

**Scoring (based on 9 critical/important checks):**
- **9/9 + No Red Flags**: Perfect - Ready for agents
- **7-8/9 + No Red Flags**: Excellent - Minor improvements
- **5-6/9 + No Red Flags**: Good - Some gaps but workable
- **3-4/9 + No Red Flags**: Needs work - Significant gaps
- **< 3/9 OR Red Flags Present**: Blocked - Must fix before implementation

**Note:** Performance, Related Issues, and Implementation Guidance are HELPFUL but not required.

**Ready for Implementation?**
- **YES**: 7+ checks passed, no red flags
- **NEEDS_WORK**: 5-6 checks passed OR minor red flags
- **NO**: < 5 checks passed OR major red flags (code snippets, step-by-step)

## Examples

### Example 1: Excellent Issue (9/9 Checks Passed)

Issue #42: "Support multiple Google accounts per user"

Agent Framework Compatibility: 9/9 Ready for implementation? YES

Strengths:

  • Clear database requirements (tables, columns, constraints)
  • Acceptance criteria: "pytest passes", "Build succeeds"
  • Files specified: user_google_accounts table, modify user_google_tokens
  • Testing: "Write tests for multi-account insertion"
  • Security: "Validate email uniqueness per user"
  • Documentation: "Update README with multi-account setup"
  • Error handling: "Return 409 if duplicate email"
  • Scope: "Account switching UI is out of scope (Issue #43)"
  • Context: Explains why users need personal + work accounts

Agent-by-Agent: implementer: Has everything needed quality-checker: Criteria match automated checks security-checker: Security requirements clear doc-checker: Documentation expectations specified review-orchestrator: Clear pass/fail criteria

Note: No performance requirements specified - this is fine. Build first, optimize later if needed.


### Example 2: Needs Work (6/9 Checks Passed)

Issue #55: "Add email search functionality"

⚠️ Agent Framework Compatibility: 6/9 Ready for implementation? NEEDS_WORK

Strengths:

  • Clear requirements: search endpoint, query parameters
  • Context explains user need
  • Affected components specified

Issues: Missing testing expectations (what tests to write?) Missing documentation requirements Missing error handling specs Vague acceptance criteria: "Search works correctly"

Agent-by-Agent: implementer: Has requirements ⚠️ quality-checker: Criteria too vague ("works correctly") ⚠️ security-checker: No SQL injection prevention mentioned ⚠️ doc-checker: No docs requirements ⚠️ review-orchestrator: Criteria not specific enough

Recommendations:

  • Add: "pytest tests pass for search functionality"
  • Add: "Update API documentation"
  • Add: "Error handling: Return 400 if invalid query, 500 if database error"
  • Replace "works correctly" with specific outcomes: "Returns matching emails", "Handles empty results"

Note: Performance requirements are optional - don't need to add unless there's a specific concern.


### Example 3: Blocked (2/9, Red Flags Present)

Issue #67: "Fix email parsing bug"

Agent Framework Compatibility: 2/9 Ready for implementation? NO

Critical Issues: 🚨 Contains 30-line complete function implementation 🚨 Prescriptive: "Add this exact function to inbox.ts line 45" 🚨 Step-by-step: "First, create parseEmail(). Second, add validation. Third, handle errors." No clear requirements (what should parsing do?) No acceptance criteria No testing expectations

Agent-by-Agent: implementer: Doesn't need to think - solution provided quality-checker: Can't validate (no criteria) security-checker: Missing validation specs doc-checker: No docs requirements review-orchestrator: No pass/fail criteria

Must fix:

  1. Remove complete function - describe WHAT parsing should do instead
    • OK to keep: "Output format: { local: string, domain: string }"
    • OK to keep: "Error message: 'Invalid email format'"
    • Remove: 30-line function implementation
  2. Add requirements: input format, output format, error handling
  3. Add acceptance criteria: "Parses valid emails", "Rejects invalid formats"
  4. Specify testing: "Add unit tests for edge cases"
  5. Remove line 45 reference - describe behavior instead
  6. Convert step-by-step to "Implementation Guidance" section

### Example 4: Good Use of Code Examples (9/9)

Issue #78: "Add user profile endpoint"

Agent Framework Compatibility: 9/9 Ready for implementation? YES

Requirements:

  • Endpoint: POST /api/profile
  • Request body: { name: string, bio?: string } ← Short example (OK!)
  • Response: { id: number, name: string, bio: string | null } ← Type def (OK!)
  • Error: { error: "Name required", code: 400 } ← Error format (OK!)

Good use of code examples:

  • Shows shape/format without providing implementation
  • Type definitions help clarify requirements
  • Error format specifies exact user-facing text
  • No function implementations

Agent-by-Agent: implementer: Clear requirements, no prescribed solution quality-checker: Testable criteria security-checker: Input validation specified doc-checker: API docs requirement present


## Integration with GHI Command

This command complements GHI:
- **GHI**: Creates new issues following best practices
- **EVI**: Evaluates existing issues against same standards

Use EVI to:
- Review issues written by other agents
- Audit issues before sending to implementation
- Train team on issue quality standards
- Ensure agent framework compatibility

## Notes

- Run EVI before implementing any issue
- Use it to review issues written by other agents or teammates
- Consider making EVI a required check in your workflow
- **11+ checks passed**: Agent-ready, implement immediately
- **8-10 checks passed**: Needs minor improvements, but workable
- **< 8 checks passed**: Must revise before implementation
- Look for red flags even in high-scoring issues

### What About Code Snippets?

**✅ Short examples are GOOD:**
- Type definitions: `interface User { id: number, name: string }`
- API responses: `{ success: true, data: {...} }`
- Error formats: `{ error: "message", code: 400 }`
- Small syntax examples (< 10 lines)

**❌ Large implementations are BAD:**
- Complete functions (15+ lines)
- Full component implementations
- Complete SQL migration scripts
- Working solutions that agents can copy-paste

**The key distinction:** Are you showing the shape/format (good) or providing the solution (bad)?