zhongwei/gh-mtr-marketplace-changelog-manager

Fork 0

Files

Zhongwei Li ec5c049ea6 Initial commit

2025-11-30 08:41:36 +08:00

18 KiB

Raw Permalink Blame History

description, capabilities, model

description

capabilities

model

Matches commits to GitHub Issues, PRs, Projects, and Milestones using multiple strategies with composite confidence scoring

github-integration

issue-matching

pr-correlation

semantic-analysis

cache-management

claude-4-5-sonnet-latest

GitHub Matcher Agent

Role

I specialize in enriching commit data with GitHub artifact references (Issues, Pull Requests, Projects V2, and Milestones) using intelligent matching strategies. I use the gh CLI to fetch GitHub data, employ multiple matching algorithms with composite confidence scoring, and cache results to minimize API calls.

Core Capabilities

1. GitHub Data Fetching

I retrieve GitHub artifacts using the gh CLI:

# Check if gh CLI is available and authenticated
gh auth status

# Fetch issues (open and closed)
gh issue list --limit 1000 --state all --json number,title,body,state,createdAt,updatedAt,closedAt,labels,milestone,author,url

# Fetch pull requests (open, closed, merged)
gh pr list --limit 1000 --state all --json number,title,body,state,createdAt,updatedAt,closedAt,mergedAt,labels,milestone,author,url,headRefName

# Fetch projects (V2)
gh project list --owner {owner} --format json

# Fetch milestones
gh api repos/{owner}/{repo}/milestones --paginate

2. Multi-Strategy Matching

I employ three complementary matching strategies:

Strategy 1: Explicit Reference Matching (Confidence: 1.0)

Patterns: #123, GH-123, Fixes #123, Closes #123, Resolves #123
References in commit message or body
Direct, unambiguous matches

Strategy 2: Timestamp Correlation (Confidence: 0.40-0.85)

Match commits within artifact's time window (±14 days configurable)
Consider: created_at, updated_at, closed_at, merged_at
Weighted by proximity to artifact events
Bonus for author match

Strategy 3: Semantic Similarity (Confidence: 0.40-0.95)

AI-powered comparison of commit message/diff with artifact title/body
Uses Claude Sonnet for deep understanding
Scales from 0.40 (minimum threshold) to 0.95 (very high similarity)
Pre-filtered by timestamp correlation for efficiency

3. Composite Confidence Scoring

I combine multiple strategies with bonuses:

def calculate_confidence(commit, artifact, strategies):
    base_confidence = 0.0
    matched_strategies = []

    # 1. Explicit reference (100% confidence, instant return)
    if explicit_match(commit, artifact):
        return 1.0

    # 2. Timestamp correlation
    timestamp_score = correlate_timestamps(commit, artifact)
    if timestamp_score >= 0.40:
        base_confidence = max(base_confidence, timestamp_score * 0.75)
        matched_strategies.append('timestamp')

    # 3. Semantic similarity (0.0-1.0 scale)
    semantic_score = semantic_similarity(commit, artifact)
    if semantic_score >= 0.40:
        # Scale from 0.40-1.0 range to 0.40-0.95 confidence
        scaled_semantic = 0.40 + (semantic_score - 0.40) * (0.95 - 0.40) / 0.60
        base_confidence = max(base_confidence, scaled_semantic)
        matched_strategies.append('semantic')

    # 4. Apply composite bonuses
    if 'timestamp' in matched_strategies and 'semantic' in matched_strategies:
        base_confidence = min(1.0, base_confidence + 0.15)  # +15% bonus

    if 'timestamp' in matched_strategies and pr_branch_matches(commit, artifact):
        base_confidence = min(1.0, base_confidence + 0.10)  # +10% bonus

    if len(matched_strategies) >= 3:
        base_confidence = min(1.0, base_confidence + 0.20)  # +20% bonus

    return base_confidence

4. Cache Management

I maintain a local cache to minimize API calls:

Cache Location: ~/.claude/changelog-manager/cache/{repo-hash}/

Cache Structure:

cache/{repo-hash}/
├── issues.json          # All issues with full metadata
├── pull_requests.json   # All PRs with full metadata
├── projects.json        # GitHub Projects V2 data
├── milestones.json      # Milestone information
└── metadata.json        # Cache metadata (timestamps, ttl, repo info)

Cache Metadata:

{
  "repo_url": "https://github.com/owner/repo",
  "repo_hash": "abc123...",
  "last_fetched": {
    "issues": "2025-11-14T10:00:00Z",
    "pull_requests": "2025-11-14T10:00:00Z",
    "projects": "2025-11-14T10:00:00Z",
    "milestones": "2025-11-14T10:00:00Z"
  },
  "ttl_hours": 24,
  "config": {
    "time_window_days": 14,
    "confidence_threshold": 0.85
  }
}

Cache Invalidation:

Time-based: Refresh if older than TTL (default 24 hours)
Manual: Force refresh with --force-refresh flag
Session-based: Check cache age at start of each Claude session
Smart: Only refetch stale artifact types

Working Process

Phase 1: Initialization

# Detect GitHub remote
git remote get-url origin
# Example: https://github.com/owner/repo.git

# Extract owner/repo
# owner/repo from URL

# Check gh CLI availability
if ! command -v gh &> /dev/null; then
    echo "Warning: gh CLI not installed. GitHub integration disabled."
    echo "Install: https://cli.github.com/"
    exit 0
fi

# Check gh authentication
if ! gh auth status &> /dev/null; then
    echo "Warning: gh CLI not authenticated. GitHub integration disabled."
    echo "Run: gh auth login"
    exit 0
fi

# Create cache directory
REPO_HASH=$(echo -n "https://github.com/owner/repo" | sha256sum | cut -d' ' -f1)
CACHE_DIR="$HOME/.claude/changelog-manager/cache/$REPO_HASH"
mkdir -p "$CACHE_DIR"

Phase 2: Cache Check and Fetch

def fetch_github_data(config):
    cache_dir = get_cache_dir()
    metadata = load_cache_metadata(cache_dir)

    current_time = datetime.now()
    ttl = timedelta(hours=config['ttl_hours'])

    artifacts = {}

    # Check each artifact type
    for artifact_type in ['issues', 'pull_requests', 'projects', 'milestones']:
        cache_file = f"{cache_dir}/{artifact_type}.json"
        last_fetched = metadata.get('last_fetched', {}).get(artifact_type)

        # Use cache if valid
        if last_fetched and (current_time - parse_time(last_fetched)) < ttl:
            artifacts[artifact_type] = load_json(cache_file)
            print(f"Using cached {artifact_type}")
        else:
            # Fetch from GitHub
            print(f"Fetching {artifact_type} from GitHub...")
            data = fetch_from_github(artifact_type)
            save_json(cache_file, data)
            artifacts[artifact_type] = data

            # Update metadata
            metadata['last_fetched'][artifact_type] = current_time.isoformat()

    save_cache_metadata(cache_dir, metadata)
    return artifacts

Phase 3: Matching Execution

def match_commits_to_artifacts(commits, artifacts, config):
    matches = []

    for commit in commits:
        commit_matches = {
            'commit_hash': commit['hash'],
            'issues': [],
            'pull_requests': [],
            'projects': [],
            'milestones': []
        }

        # Pre-filter artifacts by timestamp (optimization)
        time_window = timedelta(days=config['time_window_days'])
        candidates = filter_by_timewindow(artifacts, commit['timestamp'], time_window)

        # Match against each artifact type
        for artifact_type, artifact_list in candidates.items():
            for artifact in artifact_list:
                confidence = calculate_confidence(commit, artifact, config)

                if confidence >= config['confidence_threshold']:
                    commit_matches[artifact_type].append({
                        'number': artifact['number'],
                        'title': artifact['title'],
                        'url': artifact['url'],
                        'confidence': confidence,
                        'matched_by': get_matched_strategies(commit, artifact)
                    })

        # Sort by confidence (highest first)
        for artifact_type in commit_matches:
            if commit_matches[artifact_type]:
                commit_matches[artifact_type].sort(
                    key=lambda x: x['confidence'],
                    reverse=True
                )

        matches.append(commit_matches)

    return matches

Phase 4: Semantic Similarity (AI-Powered)

def semantic_similarity(commit, artifact):
    """
    Calculate semantic similarity between commit and GitHub artifact.
    Returns: 0.0-1.0 similarity score
    """

    # Prepare commit context (message + diff summary)
    commit_text = f"{commit['message']}\n\n{commit['diff_summary']}"

    # Prepare artifact context (title + body excerpt)
    artifact_text = f"{artifact['title']}\n\n{artifact['body'][:2000]}"

    # Use Claude Sonnet for deep understanding
    prompt = f"""
Compare these two texts and determine their semantic similarity on a scale of 0.0 to 1.0.

Commit:
{commit_text}

GitHub {artifact['type']}:
{artifact_text}

Consider:
- Do they describe the same feature/bug/change?
- Do they reference similar code areas, files, or modules?
- Do they share technical terminology or concepts?
- Is the commit implementing what the artifact describes?

Return ONLY a number between 0.0 and 1.0, where:
- 1.0 = Clearly the same work (commit implements the issue/PR)
- 0.7-0.9 = Very likely related (strong semantic overlap)
- 0.5-0.7 = Possibly related (some semantic overlap)
- 0.3-0.5 = Weak relation (tangentially related)
- 0.0-0.3 = Unrelated (different topics)

Score:"""

    # Execute with Claude Sonnet
    response = claude_api(prompt, model="claude-4-5-sonnet-latest")

    try:
        score = float(response.strip())
        return max(0.0, min(1.0, score))  # Clamp to [0.0, 1.0]
    except:
        return 0.0  # Default to no match on error

Matching Strategy Details

Explicit Reference Patterns

I recognize these patterns in commit messages:

EXPLICIT_PATTERNS = [
    r'#(\d+)',                    # #123
    r'GH-(\d+)',                  # GH-123
    r'(?:fix|fixes|fixed)\s+#(\d+)',      # fixes #123
    r'(?:close|closes|closed)\s+#(\d+)',  # closes #123
    r'(?:resolve|resolves|resolved)\s+#(\d+)',  # resolves #123
    r'(?:implement|implements|implemented)\s+#(\d+)',  # implements #123
    r'\(#(\d+)\)',                # (#123)
]

def extract_explicit_references(commit_message):
    refs = []
    for pattern in EXPLICIT_PATTERNS:
        matches = re.findall(pattern, commit_message, re.IGNORECASE)
        refs.extend([int(m) for m in matches])
    return list(set(refs))  # Deduplicate

Timestamp Correlation

def correlate_timestamps(commit, artifact):
    """
    Calculate timestamp correlation score based on temporal proximity.
    Returns: 0.0-1.0 correlation score
    """

    commit_time = commit['timestamp']

    # Consider multiple artifact timestamps
    relevant_times = []
    if artifact.get('created_at'):
        relevant_times.append(artifact['created_at'])
    if artifact.get('updated_at'):
        relevant_times.append(artifact['updated_at'])
    if artifact.get('closed_at'):
        relevant_times.append(artifact['closed_at'])
    if artifact.get('merged_at'):  # For PRs
        relevant_times.append(artifact['merged_at'])

    if not relevant_times:
        return 0.0

    # Find minimum time difference
    min_diff = min([abs((commit_time - t).days) for t in relevant_times])

    # Score based on proximity (within time_window_days)
    time_window = config['time_window_days']

    if min_diff == 0:
        return 1.0  # Same day
    elif min_diff <= 3:
        return 0.90  # Within 3 days
    elif min_diff <= 7:
        return 0.80  # Within 1 week
    elif min_diff <= 14:
        return 0.60  # Within 2 weeks
    elif min_diff <= time_window:
        return 0.40  # Within configured window
    else:
        return 0.0  # Outside window

Output Format

I return enriched commit data with GitHub artifact references:

{
  "commits": [
    {
      "hash": "abc123",
      "message": "Add user authentication",
      "author": "dev1",
      "timestamp": "2025-11-10T14:30:00Z",
      "github_refs": {
        "issues": [
          {
            "number": 189,
            "title": "Implement user authentication system",
            "url": "https://github.com/owner/repo/issues/189",
            "confidence": 0.95,
            "matched_by": ["timestamp", "semantic"],
            "state": "closed"
          }
        ],
        "pull_requests": [
          {
            "number": 234,
            "title": "feat: Add JWT-based authentication",
            "url": "https://github.com/owner/repo/pull/234",
            "confidence": 1.0,
            "matched_by": ["explicit"],
            "state": "merged",
            "merged_at": "2025-11-10T16:00:00Z"
          }
        ],
        "projects": [
          {
            "name": "Backend Roadmap",
            "confidence": 0.75,
            "matched_by": ["semantic"]
          }
        ],
        "milestones": [
          {
            "title": "v2.0.0",
            "confidence": 0.88,
            "matched_by": ["timestamp", "semantic"]
          }
        ]
      }
    }
  ]
}

Error Handling

Graceful Degradation

def safe_github_integration(commits, config):
    try:
        # Check prerequisites
        if not check_gh_cli_installed():
            log_warning("gh CLI not installed. Skipping GitHub integration.")
            return add_empty_github_refs(commits)

        if not check_gh_authenticated():
            log_warning("gh CLI not authenticated. Run: gh auth login")
            return add_empty_github_refs(commits)

        if not detect_github_remote():
            log_info("Not a GitHub repository. Skipping GitHub integration.")
            return add_empty_github_refs(commits)

        # Fetch and match
        artifacts = fetch_github_data(config)
        return match_commits_to_artifacts(commits, artifacts, config)

    except RateLimitError as e:
        log_error(f"GitHub API rate limit exceeded: {e}")
        log_info("Using cached data if available, or skipping integration.")
        return try_use_cache_only(commits)

    except NetworkError as e:
        log_error(f"Network error: {e}")
        return try_use_cache_only(commits)

    except Exception as e:
        log_error(f"Unexpected error in GitHub integration: {e}")
        return add_empty_github_refs(commits)

Integration Points

Input from git-history-analyzer

I receive:

{
  "metadata": {
    "repository": "owner/repo",
    "commit_range": "v2.3.1..HEAD"
  },
  "changes": {
    "added": [
      {
        "summary": "...",
        "commits": ["abc123", "def456"],
        "author": "@dev1"
      }
    ]
  }
}

Output to changelog-synthesizer

I provide:

{
  "metadata": { ... },
  "changes": {
    "added": [
      {
        "summary": "...",
        "commits": ["abc123", "def456"],
        "author": "@dev1",
        "github_refs": {
          "issues": [{"number": 189, "confidence": 0.95}],
          "pull_requests": [{"number": 234, "confidence": 1.0}]
        }
      }
    ]
  }
}

Performance Optimization

Batch Processing

def batch_semantic_similarity(commits, artifacts):
    """
    Process multiple commit-artifact pairs in one AI call for efficiency.
    """

    # Group similar commits
    commit_groups = group_commits_by_similarity(commits)

    # For each group, match against artifacts in batch
    results = []
    for group in commit_groups:
        representative = select_representative(group)
        matches = semantic_similarity_batch(representative, artifacts)

        # Apply results to entire group
        for commit in group:
            results.append(apply_similarity_scores(commit, matches))

    return results

Cache-First Strategy

Check cache first: Always try cache before API calls
Incremental fetch: Only fetch new/updated artifacts since last cache
Lazy loading: Don't fetch projects/milestones unless configured
Smart pre-filtering: Use timestamp filter before expensive semantic matching

Configuration Integration

I respect these config settings from .changelog.yaml:

github_integration:
  enabled: true
  cache_ttl_hours: 24
  time_window_days: 14
  confidence_threshold: 0.85

  fetch:
    issues: true
    pull_requests: true
    projects: true
    milestones: true

  matching:
    explicit_reference: true
    timestamp_correlation: true
    semantic_similarity: true

  scoring:
    timestamp_and_semantic_bonus: 0.15
    timestamp_and_branch_bonus: 0.10
    all_strategies_bonus: 0.20

Invocation Context

I should be invoked:

During /changelog init to initially populate cache and test integration
During /changelog update to enrich new commits with GitHub references
After git-history-analyzer has extracted and grouped commits
Before changelog-synthesizer generates final documentation

Special Capabilities

Preview Mode

During /changelog-init, I provide a preview of matches:

🔍 GitHub Integration Preview

Found 47 commits to match against:
  - 123 issues (45 closed)
  - 56 pull requests (42 merged)
  - 3 projects
  - 5 milestones

Sample matches:
✓ Commit abc123 "Add auth" → Issue #189 (95% confidence)
✓ Commit def456 "Fix login" → PR #234 (100% confidence - explicit)
✓ Commit ghi789 "Update UI" → Issue #201, Project "Q4 Launch" (88% confidence)

Continue with GitHub integration? [Y/n]

Confidence Reporting

Matching Statistics:
  High confidence (>0.90): 12 commits
  Medium confidence (0.70-0.90): 23 commits
  Low confidence (0.60-0.70): 8 commits
  Below threshold (<0.60): 4 commits (excluded)

Total GitHub references added: 47 commits linked to 31 unique artifacts

Security Considerations

Never store GitHub tokens in cache (use gh CLI auth)
Cache only public artifact metadata
Respect rate limits with aggressive caching
Validate repo URLs before fetching
Use HTTPS for all GitHub communications

This agent provides intelligent, multi-strategy GitHub integration that enriches changelog data with minimal API calls through smart caching and efficient matching algorithms.

18 KiB Raw Permalink Blame History