Files
2025-11-29 18:16:56 +08:00

9.7 KiB

name, description
name description
hook-deduplication-guide Use PROACTIVELY when developing Claude Code hooks to implement content-based deduplication and prevent duplicate insight storage across sessions

Hook Deduplication Guide

Overview

This skill guides you through implementing robust deduplication for Claude Code hooks, using content-based hashing instead of session-based tracking. Prevents duplicate insights from being stored while allowing multiple unique insights per session.

Based on 1 insight:

  • Hook Deduplication Session Management (hooks-and-events, 2025-11-03)

Key Capabilities:

  • Content-based deduplication using SHA256 hashes
  • Session-independent duplicate detection
  • Efficient hash storage with rotation
  • State management best practices

When to Use This Skill

Trigger Phrases:

  • "implement hook deduplication"
  • "prevent duplicate insights in hooks"
  • "content-based deduplication for hooks"
  • "hook state management patterns"

Use Cases:

  • Developing new Claude Code hooks that store data
  • Refactoring hooks to prevent duplicates
  • Implementing efficient state management for hooks
  • Debugging duplicate data issues in hooks

Do NOT use when:

  • Creating hooks that don't store data (read-only hooks)
  • Session-based deduplication is actually desired
  • Hook doesn't run frequently enough to need deduplication

Response Style

Educational and practical - explain the why behind content-based vs. session-based deduplication, then guide implementation with code examples.


Workflow

Phase 1: Choose Deduplication Strategy

Purpose: Determine whether content-based or session-based deduplication is appropriate.

Steps:

  1. Assess hook behavior:

    • How often does the hook run? (per message, per session, per event)
    • What data is being stored? (insights, logs, metrics)
    • Is the same content likely to appear across sessions?
  2. Evaluate deduplication needs:

    • Content-based: Use when the same insight/data might appear in different sessions
      • Example: Extract-explanatory-insights hook (same insight might appear in multiple conversations)
    • Session-based: Use when duplicates should only be prevented within a session
      • Example: Error logging (same error in different sessions should be logged)
  3. Recommend strategy:

    • For insights/lessons-learned: Content-based (SHA256 hashing)
    • For session logs/events: Session-based (session ID tracking)
    • For unique events: No deduplication needed

Output: Clear recommendation on deduplication strategy.

Common Issues:

  • Unsure which to use: Default to content-based for data that's meant to be unique (insights, documentation)
  • Performance concerns: Content-based hashing is fast (<1ms for typical content)

Phase 2: Implement Content-Based Deduplication

Purpose: Set up SHA256 hash-based deduplication with state management.

Steps:

  1. Create state directory:

    mkdir -p ~/.claude/state/hook-state/
    
  2. Initialize hash storage file:

    HASH_FILE="$HOME/.claude/state/hook-state/content-hashes.txt"
    touch "$HASH_FILE"
    
  3. Implement hash generation:

    # Generate SHA256 hash of content
    compute_content_hash() {
      local content="$1"
      echo -n "$content" | sha256sum | awk '{print $1}'
    }
    
  4. Check for duplicates:

    # Returns 0 if content is new, 1 if duplicate
    is_duplicate() {
      local content="$1"
      local content_hash=$(compute_content_hash "$content")
    
      if grep -Fxq "$content_hash" "$HASH_FILE"; then
        return 1  # Duplicate found
      else
        return 0  # New content
      fi
    }
    
  5. Store hash after processing:

    store_content_hash() {
      local content="$1"
      local content_hash=$(compute_content_hash "$content")
      echo "$content_hash" >> "$HASH_FILE"
    }
    
  6. Integrate into hook:

    # In your hook script
    content="extracted insight or data"
    
    if is_duplicate "$content"; then
      # Skip - duplicate content
      echo "Duplicate detected, skipping..." >&2
      exit 0
    fi
    
    # Process new content
    process_content "$content"
    
    # Store hash to prevent future duplicates
    store_content_hash "$content"
    

Output: Working content-based deduplication in your hook.

Common Issues:

  • Hash file grows too large: Implement rotation (see Phase 3)
  • False positives: Ensure content normalization (whitespace, formatting)

Phase 3: Implement Hash Rotation

Purpose: Prevent hash file from growing indefinitely.

Steps:

  1. Set rotation limit:

    MAX_HASHES=10000  # Keep last 10,000 hashes
    
  2. Implement rotation logic:

    rotate_hash_file() {
      local hash_file="$1"
      local max_hashes="${2:-10000}"
    
      # Count current hashes
      local current_count=$(wc -l < "$hash_file")
    
      # Rotate if needed
      if [ "$current_count" -gt "$max_hashes" ]; then
        tail -n "$max_hashes" "$hash_file" > "${hash_file}.tmp"
        mv "${hash_file}.tmp" "$hash_file"
        echo "Rotated hash file: kept last $max_hashes hashes" >&2
      fi
    }
    
  3. Call rotation periodically:

    # After storing new hash
    store_content_hash "$content"
    rotate_hash_file "$HASH_FILE" 10000
    

Output: Self-maintaining hash storage with bounded size.

Common Issues:

  • Rotation too aggressive: Increase MAX_HASHES
  • Rotation too infrequent: Consider checking count before every append

Phase 4: Testing and Validation

Purpose: Verify deduplication works correctly.

Steps:

  1. Test duplicate detection:

    # First run - should process
    echo "Test insight" | your_hook.sh
    # Check: Content was processed
    
    # Second run - should skip
    echo "Test insight" | your_hook.sh
    # Check: Duplicate detected message
    
  2. Test multiple unique items:

    echo "Insight 1" | your_hook.sh  # Processed
    echo "Insight 2" | your_hook.sh  # Processed
    echo "Insight 3" | your_hook.sh  # Processed
    echo "Insight 1" | your_hook.sh  # Skipped (duplicate)
    
  3. Verify hash file:

    cat ~/.claude/state/hook-state/content-hashes.txt
    # Should show 3 unique hashes (not 4)
    
  4. Test rotation:

    # Generate more than MAX_HASHES entries
    for i in {1..10500}; do
      echo "Insight $i" | your_hook.sh
    done
    
    # Verify file size bounded
    wc -l ~/.claude/state/hook-state/content-hashes.txt
    # Should be ~10000, not 10500
    

Output: Confirmed working deduplication with proper rotation.


Reference Materials


Important Reminders

  • Use content-based deduplication for insights/documentation - prevents duplicates across sessions
  • Use session-based deduplication for logs/events - same event in different sessions is meaningful
  • Normalize content before hashing - whitespace differences shouldn't create false negatives
  • Implement rotation - prevent unbounded hash file growth
  • Hash storage location: ~/.claude/state/hook-state/ (not project-specific)
  • SHA256 is fast - no performance concerns for typical hook data
  • Test both paths - verify both new content and duplicates work correctly

Warnings:

  • ⚠️ Do not use session ID alone - prevents same insight in different sessions from being stored
  • ⚠️ Do not skip rotation - hash file will grow indefinitely
  • ⚠️ Do not hash before normalization - formatting changes will cause false negatives

Best Practices

  1. Choose the Right Strategy: Content-based for unique data, session-based for session-specific events
  2. Normalize Before Hashing: Strip whitespace, lowercase if appropriate, consistent formatting
  3. Efficient Storage: Use grep -Fxq for fast hash lookups (fixed-string, line-match, quiet)
  4. Bounded Growth: Implement rotation to prevent file bloat
  5. Clear Logging: Log when duplicates are detected for debugging
  6. State Location: Use ~/.claude/state/hook-state/ for cross-project state

Troubleshooting

Duplicates not being detected

Symptoms: Same content processed multiple times

Solution:

  1. Check hash file exists and is writable
  2. Verify store_content_hash is called after processing
  3. Check content normalization (whitespace differences)
  4. Verify grep command uses -Fxq flags

Prevention: Test deduplication immediately after implementation


Hash file growing too large

Symptoms: Hash file exceeds MAX_HASHES significantly

Solution:

  1. Verify rotate_hash_file is called
  2. Check MAX_HASHES value is reasonable
  3. Manually rotate if needed: tail -n 10000 hashes.txt > hashes.tmp && mv hashes.tmp hashes.txt

Prevention: Call rotation after every hash storage


False positives (new content marked as duplicate)

Symptoms: Different content being skipped

Solution:

  1. Check for hash collisions (extremely unlikely with SHA256)
  2. Verify content is actually different
  3. Check normalization isn't too aggressive
  4. Review recent hashes in file

Prevention: Use consistent normalization, test with diverse content


Next Steps

After implementing deduplication:

  1. Monitor hash file growth over time
  2. Tune MAX_HASHES based on usage patterns
  3. Consider adding metrics (duplicates prevented, storage size)
  4. Share pattern with team for other hooks

Metadata

Source Insights:

  • Session: abc123-session-id
  • Date: 2025-11-03
  • Category: hooks-and-events
  • File: docs/lessons-learned/hooks-and-events/2025-11-03-hook-deduplication.md

Skill Version: 0.1.0 Generated: 2025-11-16 Last Updated: 2025-11-16