--- name: hook-deduplication-guide description: Use PROACTIVELY when developing Claude Code hooks to implement content-based deduplication and prevent duplicate insight storage across sessions --- # Hook Deduplication Guide ## Overview This skill guides you through implementing robust deduplication for Claude Code hooks, using content-based hashing instead of session-based tracking. Prevents duplicate insights from being stored while allowing multiple unique insights per session. **Based on 1 insight**: - Hook Deduplication Session Management (hooks-and-events, 2025-11-03) **Key Capabilities**: - Content-based deduplication using SHA256 hashes - Session-independent duplicate detection - Efficient hash storage with rotation - State management best practices ## When to Use This Skill **Trigger Phrases**: - "implement hook deduplication" - "prevent duplicate insights in hooks" - "content-based deduplication for hooks" - "hook state management patterns" **Use Cases**: - Developing new Claude Code hooks that store data - Refactoring hooks to prevent duplicates - Implementing efficient state management for hooks - Debugging duplicate data issues in hooks **Do NOT use when**: - Creating hooks that don't store data (read-only hooks) - Session-based deduplication is actually desired - Hook doesn't run frequently enough to need deduplication ## Response Style Educational and practical - explain the why behind content-based vs. session-based deduplication, then guide implementation with code examples. --- ## Workflow ### Phase 1: Choose Deduplication Strategy **Purpose**: Determine whether content-based or session-based deduplication is appropriate. **Steps**: 1. **Assess hook behavior**: - How often does the hook run? (per message, per session, per event) - What data is being stored? (insights, logs, metrics) - Is the same content likely to appear across sessions? 2. **Evaluate deduplication needs**: - **Content-based**: Use when the same insight/data might appear in different sessions - Example: Extract-explanatory-insights hook (same insight might appear in multiple conversations) - **Session-based**: Use when duplicates should only be prevented within a session - Example: Error logging (same error in different sessions should be logged) 3. **Recommend strategy**: - For insights/lessons-learned: Content-based (SHA256 hashing) - For session logs/events: Session-based (session ID tracking) - For unique events: No deduplication needed **Output**: Clear recommendation on deduplication strategy. **Common Issues**: - **Unsure which to use**: Default to content-based for data that's meant to be unique (insights, documentation) - **Performance concerns**: Content-based hashing is fast (<1ms for typical content) --- ### Phase 2: Implement Content-Based Deduplication **Purpose**: Set up SHA256 hash-based deduplication with state management. **Steps**: 1. **Create state directory**: ```bash mkdir -p ~/.claude/state/hook-state/ ``` 2. **Initialize hash storage file**: ```bash HASH_FILE="$HOME/.claude/state/hook-state/content-hashes.txt" touch "$HASH_FILE" ``` 3. **Implement hash generation**: ```bash # Generate SHA256 hash of content compute_content_hash() { local content="$1" echo -n "$content" | sha256sum | awk '{print $1}' } ``` 4. **Check for duplicates**: ```bash # Returns 0 if content is new, 1 if duplicate is_duplicate() { local content="$1" local content_hash=$(compute_content_hash "$content") if grep -Fxq "$content_hash" "$HASH_FILE"; then return 1 # Duplicate found else return 0 # New content fi } ``` 5. **Store hash after processing**: ```bash store_content_hash() { local content="$1" local content_hash=$(compute_content_hash "$content") echo "$content_hash" >> "$HASH_FILE" } ``` 6. **Integrate into hook**: ```bash # In your hook script content="extracted insight or data" if is_duplicate "$content"; then # Skip - duplicate content echo "Duplicate detected, skipping..." >&2 exit 0 fi # Process new content process_content "$content" # Store hash to prevent future duplicates store_content_hash "$content" ``` **Output**: Working content-based deduplication in your hook. **Common Issues**: - **Hash file grows too large**: Implement rotation (see Phase 3) - **False positives**: Ensure content normalization (whitespace, formatting) --- ### Phase 3: Implement Hash Rotation **Purpose**: Prevent hash file from growing indefinitely. **Steps**: 1. **Set rotation limit**: ```bash MAX_HASHES=10000 # Keep last 10,000 hashes ``` 2. **Implement rotation logic**: ```bash rotate_hash_file() { local hash_file="$1" local max_hashes="${2:-10000}" # Count current hashes local current_count=$(wc -l < "$hash_file") # Rotate if needed if [ "$current_count" -gt "$max_hashes" ]; then tail -n "$max_hashes" "$hash_file" > "${hash_file}.tmp" mv "${hash_file}.tmp" "$hash_file" echo "Rotated hash file: kept last $max_hashes hashes" >&2 fi } ``` 3. **Call rotation periodically**: ```bash # After storing new hash store_content_hash "$content" rotate_hash_file "$HASH_FILE" 10000 ``` **Output**: Self-maintaining hash storage with bounded size. **Common Issues**: - **Rotation too aggressive**: Increase MAX_HASHES - **Rotation too infrequent**: Consider checking count before every append --- ### Phase 4: Testing and Validation **Purpose**: Verify deduplication works correctly. **Steps**: 1. **Test duplicate detection**: ```bash # First run - should process echo "Test insight" | your_hook.sh # Check: Content was processed # Second run - should skip echo "Test insight" | your_hook.sh # Check: Duplicate detected message ``` 2. **Test multiple unique items**: ```bash echo "Insight 1" | your_hook.sh # Processed echo "Insight 2" | your_hook.sh # Processed echo "Insight 3" | your_hook.sh # Processed echo "Insight 1" | your_hook.sh # Skipped (duplicate) ``` 3. **Verify hash file**: ```bash cat ~/.claude/state/hook-state/content-hashes.txt # Should show 3 unique hashes (not 4) ``` 4. **Test rotation**: ```bash # Generate more than MAX_HASHES entries for i in {1..10500}; do echo "Insight $i" | your_hook.sh done # Verify file size bounded wc -l ~/.claude/state/hook-state/content-hashes.txt # Should be ~10000, not 10500 ``` **Output**: Confirmed working deduplication with proper rotation. --- ## Reference Materials - [Original Insight](data/insights-reference.md) - Full context on hook deduplication patterns --- ## Important Reminders - **Use content-based deduplication for insights/documentation** - prevents duplicates across sessions - **Use session-based deduplication for logs/events** - same event in different sessions is meaningful - **Normalize content before hashing** - whitespace differences shouldn't create false negatives - **Implement rotation** - prevent unbounded hash file growth - **Hash storage location**: `~/.claude/state/hook-state/` (not project-specific) - **SHA256 is fast** - no performance concerns for typical hook data - **Test both paths** - verify both new content and duplicates work correctly **Warnings**: - ⚠️ **Do not use session ID alone** - prevents same insight in different sessions from being stored - ⚠️ **Do not skip rotation** - hash file will grow indefinitely - ⚠️ **Do not hash before normalization** - formatting changes will cause false negatives --- ## Best Practices 1. **Choose the Right Strategy**: Content-based for unique data, session-based for session-specific events 2. **Normalize Before Hashing**: Strip whitespace, lowercase if appropriate, consistent formatting 3. **Efficient Storage**: Use grep -Fxq for fast hash lookups (fixed-string, line-match, quiet) 4. **Bounded Growth**: Implement rotation to prevent file bloat 5. **Clear Logging**: Log when duplicates are detected for debugging 6. **State Location**: Use ~/.claude/state/hook-state/ for cross-project state --- ## Troubleshooting ### Duplicates not being detected **Symptoms**: Same content processed multiple times **Solution**: 1. Check hash file exists and is writable 2. Verify store_content_hash is called after processing 3. Check content normalization (whitespace differences) 4. Verify grep command uses -Fxq flags **Prevention**: Test deduplication immediately after implementation --- ### Hash file growing too large **Symptoms**: Hash file exceeds MAX_HASHES significantly **Solution**: 1. Verify rotate_hash_file is called 2. Check MAX_HASHES value is reasonable 3. Manually rotate if needed: `tail -n 10000 hashes.txt > hashes.tmp && mv hashes.tmp hashes.txt` **Prevention**: Call rotation after every hash storage --- ### False positives (new content marked as duplicate) **Symptoms**: Different content being skipped **Solution**: 1. Check for hash collisions (extremely unlikely with SHA256) 2. Verify content is actually different 3. Check normalization isn't too aggressive 4. Review recent hashes in file **Prevention**: Use consistent normalization, test with diverse content --- ## Next Steps After implementing deduplication: 1. Monitor hash file growth over time 2. Tune MAX_HASHES based on usage patterns 3. Consider adding metrics (duplicates prevented, storage size) 4. Share pattern with team for other hooks --- ## Metadata **Source Insights**: - Session: abc123-session-id - Date: 2025-11-03 - Category: hooks-and-events - File: docs/lessons-learned/hooks-and-events/2025-11-03-hook-deduplication.md **Skill Version**: 0.1.0 **Generated**: 2025-11-16 **Last Updated**: 2025-11-16