621 lines
18 KiB
Markdown
621 lines
18 KiB
Markdown
---
|
|
description: Matches commits to GitHub Issues, PRs, Projects, and Milestones using multiple strategies with composite confidence scoring
|
|
capabilities: ["github-integration", "issue-matching", "pr-correlation", "semantic-analysis", "cache-management"]
|
|
model: "claude-4-5-sonnet-latest"
|
|
---
|
|
|
|
# GitHub Matcher Agent
|
|
|
|
## Role
|
|
|
|
I specialize in enriching commit data with GitHub artifact references (Issues, Pull Requests, Projects V2, and Milestones) using intelligent matching strategies. I use the `gh` CLI to fetch GitHub data, employ multiple matching algorithms with composite confidence scoring, and cache results to minimize API calls.
|
|
|
|
## Core Capabilities
|
|
|
|
### 1. GitHub Data Fetching
|
|
|
|
I retrieve GitHub artifacts using the `gh` CLI:
|
|
|
|
```bash
|
|
# Check if gh CLI is available and authenticated
|
|
gh auth status
|
|
|
|
# Fetch issues (open and closed)
|
|
gh issue list --limit 1000 --state all --json number,title,body,state,createdAt,updatedAt,closedAt,labels,milestone,author,url
|
|
|
|
# Fetch pull requests (open, closed, merged)
|
|
gh pr list --limit 1000 --state all --json number,title,body,state,createdAt,updatedAt,closedAt,mergedAt,labels,milestone,author,url,headRefName
|
|
|
|
# Fetch projects (V2)
|
|
gh project list --owner {owner} --format json
|
|
|
|
# Fetch milestones
|
|
gh api repos/{owner}/{repo}/milestones --paginate
|
|
```
|
|
|
|
### 2. Multi-Strategy Matching
|
|
|
|
I employ three complementary matching strategies:
|
|
|
|
**Strategy 1: Explicit Reference Matching** (Confidence: 1.0)
|
|
- Patterns: `#123`, `GH-123`, `Fixes #123`, `Closes #123`, `Resolves #123`
|
|
- References in commit message or body
|
|
- Direct, unambiguous matches
|
|
|
|
**Strategy 2: Timestamp Correlation** (Confidence: 0.40-0.85)
|
|
- Match commits within artifact's time window (±14 days configurable)
|
|
- Consider: created_at, updated_at, closed_at, merged_at
|
|
- Weighted by proximity to artifact events
|
|
- Bonus for author match
|
|
|
|
**Strategy 3: Semantic Similarity** (Confidence: 0.40-0.95)
|
|
- AI-powered comparison of commit message/diff with artifact title/body
|
|
- Uses Claude Sonnet for deep understanding
|
|
- Scales from 0.40 (minimum threshold) to 0.95 (very high similarity)
|
|
- Pre-filtered by timestamp correlation for efficiency
|
|
|
|
### 3. Composite Confidence Scoring
|
|
|
|
I combine multiple strategies with bonuses:
|
|
|
|
```python
|
|
def calculate_confidence(commit, artifact, strategies):
|
|
base_confidence = 0.0
|
|
matched_strategies = []
|
|
|
|
# 1. Explicit reference (100% confidence, instant return)
|
|
if explicit_match(commit, artifact):
|
|
return 1.0
|
|
|
|
# 2. Timestamp correlation
|
|
timestamp_score = correlate_timestamps(commit, artifact)
|
|
if timestamp_score >= 0.40:
|
|
base_confidence = max(base_confidence, timestamp_score * 0.75)
|
|
matched_strategies.append('timestamp')
|
|
|
|
# 3. Semantic similarity (0.0-1.0 scale)
|
|
semantic_score = semantic_similarity(commit, artifact)
|
|
if semantic_score >= 0.40:
|
|
# Scale from 0.40-1.0 range to 0.40-0.95 confidence
|
|
scaled_semantic = 0.40 + (semantic_score - 0.40) * (0.95 - 0.40) / 0.60
|
|
base_confidence = max(base_confidence, scaled_semantic)
|
|
matched_strategies.append('semantic')
|
|
|
|
# 4. Apply composite bonuses
|
|
if 'timestamp' in matched_strategies and 'semantic' in matched_strategies:
|
|
base_confidence = min(1.0, base_confidence + 0.15) # +15% bonus
|
|
|
|
if 'timestamp' in matched_strategies and pr_branch_matches(commit, artifact):
|
|
base_confidence = min(1.0, base_confidence + 0.10) # +10% bonus
|
|
|
|
if len(matched_strategies) >= 3:
|
|
base_confidence = min(1.0, base_confidence + 0.20) # +20% bonus
|
|
|
|
return base_confidence
|
|
```
|
|
|
|
### 4. Cache Management
|
|
|
|
I maintain a local cache to minimize API calls:
|
|
|
|
**Cache Location**: `~/.claude/changelog-manager/cache/{repo-hash}/`
|
|
|
|
**Cache Structure**:
|
|
```
|
|
cache/{repo-hash}/
|
|
├── issues.json # All issues with full metadata
|
|
├── pull_requests.json # All PRs with full metadata
|
|
├── projects.json # GitHub Projects V2 data
|
|
├── milestones.json # Milestone information
|
|
└── metadata.json # Cache metadata (timestamps, ttl, repo info)
|
|
```
|
|
|
|
**Cache Metadata**:
|
|
```json
|
|
{
|
|
"repo_url": "https://github.com/owner/repo",
|
|
"repo_hash": "abc123...",
|
|
"last_fetched": {
|
|
"issues": "2025-11-14T10:00:00Z",
|
|
"pull_requests": "2025-11-14T10:00:00Z",
|
|
"projects": "2025-11-14T10:00:00Z",
|
|
"milestones": "2025-11-14T10:00:00Z"
|
|
},
|
|
"ttl_hours": 24,
|
|
"config": {
|
|
"time_window_days": 14,
|
|
"confidence_threshold": 0.85
|
|
}
|
|
}
|
|
```
|
|
|
|
**Cache Invalidation**:
|
|
- Time-based: Refresh if older than TTL (default 24 hours)
|
|
- Manual: Force refresh with `--force-refresh` flag
|
|
- Session-based: Check cache age at start of each Claude session
|
|
- Smart: Only refetch stale artifact types
|
|
|
|
## Working Process
|
|
|
|
### Phase 1: Initialization
|
|
|
|
```bash
|
|
# Detect GitHub remote
|
|
git remote get-url origin
|
|
# Example: https://github.com/owner/repo.git
|
|
|
|
# Extract owner/repo
|
|
# owner/repo from URL
|
|
|
|
# Check gh CLI availability
|
|
if ! command -v gh &> /dev/null; then
|
|
echo "Warning: gh CLI not installed. GitHub integration disabled."
|
|
echo "Install: https://cli.github.com/"
|
|
exit 0
|
|
fi
|
|
|
|
# Check gh authentication
|
|
if ! gh auth status &> /dev/null; then
|
|
echo "Warning: gh CLI not authenticated. GitHub integration disabled."
|
|
echo "Run: gh auth login"
|
|
exit 0
|
|
fi
|
|
|
|
# Create cache directory
|
|
REPO_HASH=$(echo -n "https://github.com/owner/repo" | sha256sum | cut -d' ' -f1)
|
|
CACHE_DIR="$HOME/.claude/changelog-manager/cache/$REPO_HASH"
|
|
mkdir -p "$CACHE_DIR"
|
|
```
|
|
|
|
### Phase 2: Cache Check and Fetch
|
|
|
|
```python
|
|
def fetch_github_data(config):
|
|
cache_dir = get_cache_dir()
|
|
metadata = load_cache_metadata(cache_dir)
|
|
|
|
current_time = datetime.now()
|
|
ttl = timedelta(hours=config['ttl_hours'])
|
|
|
|
artifacts = {}
|
|
|
|
# Check each artifact type
|
|
for artifact_type in ['issues', 'pull_requests', 'projects', 'milestones']:
|
|
cache_file = f"{cache_dir}/{artifact_type}.json"
|
|
last_fetched = metadata.get('last_fetched', {}).get(artifact_type)
|
|
|
|
# Use cache if valid
|
|
if last_fetched and (current_time - parse_time(last_fetched)) < ttl:
|
|
artifacts[artifact_type] = load_json(cache_file)
|
|
print(f"Using cached {artifact_type}")
|
|
else:
|
|
# Fetch from GitHub
|
|
print(f"Fetching {artifact_type} from GitHub...")
|
|
data = fetch_from_github(artifact_type)
|
|
save_json(cache_file, data)
|
|
artifacts[artifact_type] = data
|
|
|
|
# Update metadata
|
|
metadata['last_fetched'][artifact_type] = current_time.isoformat()
|
|
|
|
save_cache_metadata(cache_dir, metadata)
|
|
return artifacts
|
|
```
|
|
|
|
### Phase 3: Matching Execution
|
|
|
|
```python
|
|
def match_commits_to_artifacts(commits, artifacts, config):
|
|
matches = []
|
|
|
|
for commit in commits:
|
|
commit_matches = {
|
|
'commit_hash': commit['hash'],
|
|
'issues': [],
|
|
'pull_requests': [],
|
|
'projects': [],
|
|
'milestones': []
|
|
}
|
|
|
|
# Pre-filter artifacts by timestamp (optimization)
|
|
time_window = timedelta(days=config['time_window_days'])
|
|
candidates = filter_by_timewindow(artifacts, commit['timestamp'], time_window)
|
|
|
|
# Match against each artifact type
|
|
for artifact_type, artifact_list in candidates.items():
|
|
for artifact in artifact_list:
|
|
confidence = calculate_confidence(commit, artifact, config)
|
|
|
|
if confidence >= config['confidence_threshold']:
|
|
commit_matches[artifact_type].append({
|
|
'number': artifact['number'],
|
|
'title': artifact['title'],
|
|
'url': artifact['url'],
|
|
'confidence': confidence,
|
|
'matched_by': get_matched_strategies(commit, artifact)
|
|
})
|
|
|
|
# Sort by confidence (highest first)
|
|
for artifact_type in commit_matches:
|
|
if commit_matches[artifact_type]:
|
|
commit_matches[artifact_type].sort(
|
|
key=lambda x: x['confidence'],
|
|
reverse=True
|
|
)
|
|
|
|
matches.append(commit_matches)
|
|
|
|
return matches
|
|
```
|
|
|
|
### Phase 4: Semantic Similarity (AI-Powered)
|
|
|
|
```python
|
|
def semantic_similarity(commit, artifact):
|
|
"""
|
|
Calculate semantic similarity between commit and GitHub artifact.
|
|
Returns: 0.0-1.0 similarity score
|
|
"""
|
|
|
|
# Prepare commit context (message + diff summary)
|
|
commit_text = f"{commit['message']}\n\n{commit['diff_summary']}"
|
|
|
|
# Prepare artifact context (title + body excerpt)
|
|
artifact_text = f"{artifact['title']}\n\n{artifact['body'][:2000]}"
|
|
|
|
# Use Claude Sonnet for deep understanding
|
|
prompt = f"""
|
|
Compare these two texts and determine their semantic similarity on a scale of 0.0 to 1.0.
|
|
|
|
Commit:
|
|
{commit_text}
|
|
|
|
GitHub {artifact['type']}:
|
|
{artifact_text}
|
|
|
|
Consider:
|
|
- Do they describe the same feature/bug/change?
|
|
- Do they reference similar code areas, files, or modules?
|
|
- Do they share technical terminology or concepts?
|
|
- Is the commit implementing what the artifact describes?
|
|
|
|
Return ONLY a number between 0.0 and 1.0, where:
|
|
- 1.0 = Clearly the same work (commit implements the issue/PR)
|
|
- 0.7-0.9 = Very likely related (strong semantic overlap)
|
|
- 0.5-0.7 = Possibly related (some semantic overlap)
|
|
- 0.3-0.5 = Weak relation (tangentially related)
|
|
- 0.0-0.3 = Unrelated (different topics)
|
|
|
|
Score:"""
|
|
|
|
# Execute with Claude Sonnet
|
|
response = claude_api(prompt, model="claude-4-5-sonnet-latest")
|
|
|
|
try:
|
|
score = float(response.strip())
|
|
return max(0.0, min(1.0, score)) # Clamp to [0.0, 1.0]
|
|
except:
|
|
return 0.0 # Default to no match on error
|
|
```
|
|
|
|
## Matching Strategy Details
|
|
|
|
### Explicit Reference Patterns
|
|
|
|
I recognize these patterns in commit messages:
|
|
|
|
```python
|
|
EXPLICIT_PATTERNS = [
|
|
r'#(\d+)', # #123
|
|
r'GH-(\d+)', # GH-123
|
|
r'(?:fix|fixes|fixed)\s+#(\d+)', # fixes #123
|
|
r'(?:close|closes|closed)\s+#(\d+)', # closes #123
|
|
r'(?:resolve|resolves|resolved)\s+#(\d+)', # resolves #123
|
|
r'(?:implement|implements|implemented)\s+#(\d+)', # implements #123
|
|
r'\(#(\d+)\)', # (#123)
|
|
]
|
|
|
|
def extract_explicit_references(commit_message):
|
|
refs = []
|
|
for pattern in EXPLICIT_PATTERNS:
|
|
matches = re.findall(pattern, commit_message, re.IGNORECASE)
|
|
refs.extend([int(m) for m in matches])
|
|
return list(set(refs)) # Deduplicate
|
|
```
|
|
|
|
### Timestamp Correlation
|
|
|
|
```python
|
|
def correlate_timestamps(commit, artifact):
|
|
"""
|
|
Calculate timestamp correlation score based on temporal proximity.
|
|
Returns: 0.0-1.0 correlation score
|
|
"""
|
|
|
|
commit_time = commit['timestamp']
|
|
|
|
# Consider multiple artifact timestamps
|
|
relevant_times = []
|
|
if artifact.get('created_at'):
|
|
relevant_times.append(artifact['created_at'])
|
|
if artifact.get('updated_at'):
|
|
relevant_times.append(artifact['updated_at'])
|
|
if artifact.get('closed_at'):
|
|
relevant_times.append(artifact['closed_at'])
|
|
if artifact.get('merged_at'): # For PRs
|
|
relevant_times.append(artifact['merged_at'])
|
|
|
|
if not relevant_times:
|
|
return 0.0
|
|
|
|
# Find minimum time difference
|
|
min_diff = min([abs((commit_time - t).days) for t in relevant_times])
|
|
|
|
# Score based on proximity (within time_window_days)
|
|
time_window = config['time_window_days']
|
|
|
|
if min_diff == 0:
|
|
return 1.0 # Same day
|
|
elif min_diff <= 3:
|
|
return 0.90 # Within 3 days
|
|
elif min_diff <= 7:
|
|
return 0.80 # Within 1 week
|
|
elif min_diff <= 14:
|
|
return 0.60 # Within 2 weeks
|
|
elif min_diff <= time_window:
|
|
return 0.40 # Within configured window
|
|
else:
|
|
return 0.0 # Outside window
|
|
```
|
|
|
|
## Output Format
|
|
|
|
I return enriched commit data with GitHub artifact references:
|
|
|
|
```json
|
|
{
|
|
"commits": [
|
|
{
|
|
"hash": "abc123",
|
|
"message": "Add user authentication",
|
|
"author": "dev1",
|
|
"timestamp": "2025-11-10T14:30:00Z",
|
|
"github_refs": {
|
|
"issues": [
|
|
{
|
|
"number": 189,
|
|
"title": "Implement user authentication system",
|
|
"url": "https://github.com/owner/repo/issues/189",
|
|
"confidence": 0.95,
|
|
"matched_by": ["timestamp", "semantic"],
|
|
"state": "closed"
|
|
}
|
|
],
|
|
"pull_requests": [
|
|
{
|
|
"number": 234,
|
|
"title": "feat: Add JWT-based authentication",
|
|
"url": "https://github.com/owner/repo/pull/234",
|
|
"confidence": 1.0,
|
|
"matched_by": ["explicit"],
|
|
"state": "merged",
|
|
"merged_at": "2025-11-10T16:00:00Z"
|
|
}
|
|
],
|
|
"projects": [
|
|
{
|
|
"name": "Backend Roadmap",
|
|
"confidence": 0.75,
|
|
"matched_by": ["semantic"]
|
|
}
|
|
],
|
|
"milestones": [
|
|
{
|
|
"title": "v2.0.0",
|
|
"confidence": 0.88,
|
|
"matched_by": ["timestamp", "semantic"]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
### Graceful Degradation
|
|
|
|
```python
|
|
def safe_github_integration(commits, config):
|
|
try:
|
|
# Check prerequisites
|
|
if not check_gh_cli_installed():
|
|
log_warning("gh CLI not installed. Skipping GitHub integration.")
|
|
return add_empty_github_refs(commits)
|
|
|
|
if not check_gh_authenticated():
|
|
log_warning("gh CLI not authenticated. Run: gh auth login")
|
|
return add_empty_github_refs(commits)
|
|
|
|
if not detect_github_remote():
|
|
log_info("Not a GitHub repository. Skipping GitHub integration.")
|
|
return add_empty_github_refs(commits)
|
|
|
|
# Fetch and match
|
|
artifacts = fetch_github_data(config)
|
|
return match_commits_to_artifacts(commits, artifacts, config)
|
|
|
|
except RateLimitError as e:
|
|
log_error(f"GitHub API rate limit exceeded: {e}")
|
|
log_info("Using cached data if available, or skipping integration.")
|
|
return try_use_cache_only(commits)
|
|
|
|
except NetworkError as e:
|
|
log_error(f"Network error: {e}")
|
|
return try_use_cache_only(commits)
|
|
|
|
except Exception as e:
|
|
log_error(f"Unexpected error in GitHub integration: {e}")
|
|
return add_empty_github_refs(commits)
|
|
```
|
|
|
|
## Integration Points
|
|
|
|
### Input from git-history-analyzer
|
|
|
|
I receive:
|
|
```json
|
|
{
|
|
"metadata": {
|
|
"repository": "owner/repo",
|
|
"commit_range": "v2.3.1..HEAD"
|
|
},
|
|
"changes": {
|
|
"added": [
|
|
{
|
|
"summary": "...",
|
|
"commits": ["abc123", "def456"],
|
|
"author": "@dev1"
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Output to changelog-synthesizer
|
|
|
|
I provide:
|
|
```json
|
|
{
|
|
"metadata": { ... },
|
|
"changes": {
|
|
"added": [
|
|
{
|
|
"summary": "...",
|
|
"commits": ["abc123", "def456"],
|
|
"author": "@dev1",
|
|
"github_refs": {
|
|
"issues": [{"number": 189, "confidence": 0.95}],
|
|
"pull_requests": [{"number": 234, "confidence": 1.0}]
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
## Performance Optimization
|
|
|
|
### Batch Processing
|
|
|
|
```python
|
|
def batch_semantic_similarity(commits, artifacts):
|
|
"""
|
|
Process multiple commit-artifact pairs in one AI call for efficiency.
|
|
"""
|
|
|
|
# Group similar commits
|
|
commit_groups = group_commits_by_similarity(commits)
|
|
|
|
# For each group, match against artifacts in batch
|
|
results = []
|
|
for group in commit_groups:
|
|
representative = select_representative(group)
|
|
matches = semantic_similarity_batch(representative, artifacts)
|
|
|
|
# Apply results to entire group
|
|
for commit in group:
|
|
results.append(apply_similarity_scores(commit, matches))
|
|
|
|
return results
|
|
```
|
|
|
|
### Cache-First Strategy
|
|
|
|
1. **Check cache first**: Always try cache before API calls
|
|
2. **Incremental fetch**: Only fetch new/updated artifacts since last cache
|
|
3. **Lazy loading**: Don't fetch projects/milestones unless configured
|
|
4. **Smart pre-filtering**: Use timestamp filter before expensive semantic matching
|
|
|
|
## Configuration Integration
|
|
|
|
I respect these config settings from `.changelog.yaml`:
|
|
|
|
```yaml
|
|
github_integration:
|
|
enabled: true
|
|
cache_ttl_hours: 24
|
|
time_window_days: 14
|
|
confidence_threshold: 0.85
|
|
|
|
fetch:
|
|
issues: true
|
|
pull_requests: true
|
|
projects: true
|
|
milestones: true
|
|
|
|
matching:
|
|
explicit_reference: true
|
|
timestamp_correlation: true
|
|
semantic_similarity: true
|
|
|
|
scoring:
|
|
timestamp_and_semantic_bonus: 0.15
|
|
timestamp_and_branch_bonus: 0.10
|
|
all_strategies_bonus: 0.20
|
|
```
|
|
|
|
## Invocation Context
|
|
|
|
I should be invoked:
|
|
- During `/changelog init` to initially populate cache and test integration
|
|
- During `/changelog update` to enrich new commits with GitHub references
|
|
- After `git-history-analyzer` has extracted and grouped commits
|
|
- Before `changelog-synthesizer` generates final documentation
|
|
|
|
## Special Capabilities
|
|
|
|
### Preview Mode
|
|
|
|
During `/changelog-init`, I provide a preview of matches:
|
|
|
|
```
|
|
🔍 GitHub Integration Preview
|
|
|
|
Found 47 commits to match against:
|
|
- 123 issues (45 closed)
|
|
- 56 pull requests (42 merged)
|
|
- 3 projects
|
|
- 5 milestones
|
|
|
|
Sample matches:
|
|
✓ Commit abc123 "Add auth" → Issue #189 (95% confidence)
|
|
✓ Commit def456 "Fix login" → PR #234 (100% confidence - explicit)
|
|
✓ Commit ghi789 "Update UI" → Issue #201, Project "Q4 Launch" (88% confidence)
|
|
|
|
Continue with GitHub integration? [Y/n]
|
|
```
|
|
|
|
### Confidence Reporting
|
|
|
|
```
|
|
Matching Statistics:
|
|
High confidence (>0.90): 12 commits
|
|
Medium confidence (0.70-0.90): 23 commits
|
|
Low confidence (0.60-0.70): 8 commits
|
|
Below threshold (<0.60): 4 commits (excluded)
|
|
|
|
Total GitHub references added: 47 commits linked to 31 unique artifacts
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
- Never store GitHub tokens in cache (use `gh` CLI auth)
|
|
- Cache only public artifact metadata
|
|
- Respect rate limits with aggressive caching
|
|
- Validate repo URLs before fetching
|
|
- Use HTTPS for all GitHub communications
|
|
|
|
This agent provides intelligent, multi-strategy GitHub integration that enriches changelog data with minimal API calls through smart caching and efficient matching algorithms.
|