9.7 KiB
name, description
| name | description |
|---|---|
| hook-deduplication-guide | Use PROACTIVELY when developing Claude Code hooks to implement content-based deduplication and prevent duplicate insight storage across sessions |
Hook Deduplication Guide
Overview
This skill guides you through implementing robust deduplication for Claude Code hooks, using content-based hashing instead of session-based tracking. Prevents duplicate insights from being stored while allowing multiple unique insights per session.
Based on 1 insight:
- Hook Deduplication Session Management (hooks-and-events, 2025-11-03)
Key Capabilities:
- Content-based deduplication using SHA256 hashes
- Session-independent duplicate detection
- Efficient hash storage with rotation
- State management best practices
When to Use This Skill
Trigger Phrases:
- "implement hook deduplication"
- "prevent duplicate insights in hooks"
- "content-based deduplication for hooks"
- "hook state management patterns"
Use Cases:
- Developing new Claude Code hooks that store data
- Refactoring hooks to prevent duplicates
- Implementing efficient state management for hooks
- Debugging duplicate data issues in hooks
Do NOT use when:
- Creating hooks that don't store data (read-only hooks)
- Session-based deduplication is actually desired
- Hook doesn't run frequently enough to need deduplication
Response Style
Educational and practical - explain the why behind content-based vs. session-based deduplication, then guide implementation with code examples.
Workflow
Phase 1: Choose Deduplication Strategy
Purpose: Determine whether content-based or session-based deduplication is appropriate.
Steps:
-
Assess hook behavior:
- How often does the hook run? (per message, per session, per event)
- What data is being stored? (insights, logs, metrics)
- Is the same content likely to appear across sessions?
-
Evaluate deduplication needs:
- Content-based: Use when the same insight/data might appear in different sessions
- Example: Extract-explanatory-insights hook (same insight might appear in multiple conversations)
- Session-based: Use when duplicates should only be prevented within a session
- Example: Error logging (same error in different sessions should be logged)
- Content-based: Use when the same insight/data might appear in different sessions
-
Recommend strategy:
- For insights/lessons-learned: Content-based (SHA256 hashing)
- For session logs/events: Session-based (session ID tracking)
- For unique events: No deduplication needed
Output: Clear recommendation on deduplication strategy.
Common Issues:
- Unsure which to use: Default to content-based for data that's meant to be unique (insights, documentation)
- Performance concerns: Content-based hashing is fast (<1ms for typical content)
Phase 2: Implement Content-Based Deduplication
Purpose: Set up SHA256 hash-based deduplication with state management.
Steps:
-
Create state directory:
mkdir -p ~/.claude/state/hook-state/ -
Initialize hash storage file:
HASH_FILE="$HOME/.claude/state/hook-state/content-hashes.txt" touch "$HASH_FILE" -
Implement hash generation:
# Generate SHA256 hash of content compute_content_hash() { local content="$1" echo -n "$content" | sha256sum | awk '{print $1}' } -
Check for duplicates:
# Returns 0 if content is new, 1 if duplicate is_duplicate() { local content="$1" local content_hash=$(compute_content_hash "$content") if grep -Fxq "$content_hash" "$HASH_FILE"; then return 1 # Duplicate found else return 0 # New content fi } -
Store hash after processing:
store_content_hash() { local content="$1" local content_hash=$(compute_content_hash "$content") echo "$content_hash" >> "$HASH_FILE" } -
Integrate into hook:
# In your hook script content="extracted insight or data" if is_duplicate "$content"; then # Skip - duplicate content echo "Duplicate detected, skipping..." >&2 exit 0 fi # Process new content process_content "$content" # Store hash to prevent future duplicates store_content_hash "$content"
Output: Working content-based deduplication in your hook.
Common Issues:
- Hash file grows too large: Implement rotation (see Phase 3)
- False positives: Ensure content normalization (whitespace, formatting)
Phase 3: Implement Hash Rotation
Purpose: Prevent hash file from growing indefinitely.
Steps:
-
Set rotation limit:
MAX_HASHES=10000 # Keep last 10,000 hashes -
Implement rotation logic:
rotate_hash_file() { local hash_file="$1" local max_hashes="${2:-10000}" # Count current hashes local current_count=$(wc -l < "$hash_file") # Rotate if needed if [ "$current_count" -gt "$max_hashes" ]; then tail -n "$max_hashes" "$hash_file" > "${hash_file}.tmp" mv "${hash_file}.tmp" "$hash_file" echo "Rotated hash file: kept last $max_hashes hashes" >&2 fi } -
Call rotation periodically:
# After storing new hash store_content_hash "$content" rotate_hash_file "$HASH_FILE" 10000
Output: Self-maintaining hash storage with bounded size.
Common Issues:
- Rotation too aggressive: Increase MAX_HASHES
- Rotation too infrequent: Consider checking count before every append
Phase 4: Testing and Validation
Purpose: Verify deduplication works correctly.
Steps:
-
Test duplicate detection:
# First run - should process echo "Test insight" | your_hook.sh # Check: Content was processed # Second run - should skip echo "Test insight" | your_hook.sh # Check: Duplicate detected message -
Test multiple unique items:
echo "Insight 1" | your_hook.sh # Processed echo "Insight 2" | your_hook.sh # Processed echo "Insight 3" | your_hook.sh # Processed echo "Insight 1" | your_hook.sh # Skipped (duplicate) -
Verify hash file:
cat ~/.claude/state/hook-state/content-hashes.txt # Should show 3 unique hashes (not 4) -
Test rotation:
# Generate more than MAX_HASHES entries for i in {1..10500}; do echo "Insight $i" | your_hook.sh done # Verify file size bounded wc -l ~/.claude/state/hook-state/content-hashes.txt # Should be ~10000, not 10500
Output: Confirmed working deduplication with proper rotation.
Reference Materials
- Original Insight - Full context on hook deduplication patterns
Important Reminders
- Use content-based deduplication for insights/documentation - prevents duplicates across sessions
- Use session-based deduplication for logs/events - same event in different sessions is meaningful
- Normalize content before hashing - whitespace differences shouldn't create false negatives
- Implement rotation - prevent unbounded hash file growth
- Hash storage location:
~/.claude/state/hook-state/(not project-specific) - SHA256 is fast - no performance concerns for typical hook data
- Test both paths - verify both new content and duplicates work correctly
Warnings:
- ⚠️ Do not use session ID alone - prevents same insight in different sessions from being stored
- ⚠️ Do not skip rotation - hash file will grow indefinitely
- ⚠️ Do not hash before normalization - formatting changes will cause false negatives
Best Practices
- Choose the Right Strategy: Content-based for unique data, session-based for session-specific events
- Normalize Before Hashing: Strip whitespace, lowercase if appropriate, consistent formatting
- Efficient Storage: Use grep -Fxq for fast hash lookups (fixed-string, line-match, quiet)
- Bounded Growth: Implement rotation to prevent file bloat
- Clear Logging: Log when duplicates are detected for debugging
- State Location: Use ~/.claude/state/hook-state/ for cross-project state
Troubleshooting
Duplicates not being detected
Symptoms: Same content processed multiple times
Solution:
- Check hash file exists and is writable
- Verify store_content_hash is called after processing
- Check content normalization (whitespace differences)
- Verify grep command uses -Fxq flags
Prevention: Test deduplication immediately after implementation
Hash file growing too large
Symptoms: Hash file exceeds MAX_HASHES significantly
Solution:
- Verify rotate_hash_file is called
- Check MAX_HASHES value is reasonable
- Manually rotate if needed:
tail -n 10000 hashes.txt > hashes.tmp && mv hashes.tmp hashes.txt
Prevention: Call rotation after every hash storage
False positives (new content marked as duplicate)
Symptoms: Different content being skipped
Solution:
- Check for hash collisions (extremely unlikely with SHA256)
- Verify content is actually different
- Check normalization isn't too aggressive
- Review recent hashes in file
Prevention: Use consistent normalization, test with diverse content
Next Steps
After implementing deduplication:
- Monitor hash file growth over time
- Tune MAX_HASHES based on usage patterns
- Consider adding metrics (duplicates prevented, storage size)
- Share pattern with team for other hooks
Metadata
Source Insights:
- Session: abc123-session-id
- Date: 2025-11-03
- Category: hooks-and-events
- File: docs/lessons-learned/hooks-and-events/2025-11-03-hook-deduplication.md
Skill Version: 0.1.0 Generated: 2025-11-16 Last Updated: 2025-11-16