Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:16:56 +08:00
commit 8a3d331e04
61 changed files with 11808 additions and 0 deletions

View File

@@ -0,0 +1,342 @@
---
name: hook-deduplication-guide
description: Use PROACTIVELY when developing Claude Code hooks to implement content-based deduplication and prevent duplicate insight storage across sessions
---
# Hook Deduplication Guide
## Overview
This skill guides you through implementing robust deduplication for Claude Code hooks, using content-based hashing instead of session-based tracking. Prevents duplicate insights from being stored while allowing multiple unique insights per session.
**Based on 1 insight**:
- Hook Deduplication Session Management (hooks-and-events, 2025-11-03)
**Key Capabilities**:
- Content-based deduplication using SHA256 hashes
- Session-independent duplicate detection
- Efficient hash storage with rotation
- State management best practices
## When to Use This Skill
**Trigger Phrases**:
- "implement hook deduplication"
- "prevent duplicate insights in hooks"
- "content-based deduplication for hooks"
- "hook state management patterns"
**Use Cases**:
- Developing new Claude Code hooks that store data
- Refactoring hooks to prevent duplicates
- Implementing efficient state management for hooks
- Debugging duplicate data issues in hooks
**Do NOT use when**:
- Creating hooks that don't store data (read-only hooks)
- Session-based deduplication is actually desired
- Hook doesn't run frequently enough to need deduplication
## Response Style
Educational and practical - explain the why behind content-based vs. session-based deduplication, then guide implementation with code examples.
---
## Workflow
### Phase 1: Choose Deduplication Strategy
**Purpose**: Determine whether content-based or session-based deduplication is appropriate.
**Steps**:
1. **Assess hook behavior**:
- How often does the hook run? (per message, per session, per event)
- What data is being stored? (insights, logs, metrics)
- Is the same content likely to appear across sessions?
2. **Evaluate deduplication needs**:
- **Content-based**: Use when the same insight/data might appear in different sessions
- Example: Extract-explanatory-insights hook (same insight might appear in multiple conversations)
- **Session-based**: Use when duplicates should only be prevented within a session
- Example: Error logging (same error in different sessions should be logged)
3. **Recommend strategy**:
- For insights/lessons-learned: Content-based (SHA256 hashing)
- For session logs/events: Session-based (session ID tracking)
- For unique events: No deduplication needed
**Output**: Clear recommendation on deduplication strategy.
**Common Issues**:
- **Unsure which to use**: Default to content-based for data that's meant to be unique (insights, documentation)
- **Performance concerns**: Content-based hashing is fast (<1ms for typical content)
---
### Phase 2: Implement Content-Based Deduplication
**Purpose**: Set up SHA256 hash-based deduplication with state management.
**Steps**:
1. **Create state directory**:
```bash
mkdir -p ~/.claude/state/hook-state/
```
2. **Initialize hash storage file**:
```bash
HASH_FILE="$HOME/.claude/state/hook-state/content-hashes.txt"
touch "$HASH_FILE"
```
3. **Implement hash generation**:
```bash
# Generate SHA256 hash of content
compute_content_hash() {
local content="$1"
echo -n "$content" | sha256sum | awk '{print $1}'
}
```
4. **Check for duplicates**:
```bash
# Returns 0 if content is new, 1 if duplicate
is_duplicate() {
local content="$1"
local content_hash=$(compute_content_hash "$content")
if grep -Fxq "$content_hash" "$HASH_FILE"; then
return 1 # Duplicate found
else
return 0 # New content
fi
}
```
5. **Store hash after processing**:
```bash
store_content_hash() {
local content="$1"
local content_hash=$(compute_content_hash "$content")
echo "$content_hash" >> "$HASH_FILE"
}
```
6. **Integrate into hook**:
```bash
# In your hook script
content="extracted insight or data"
if is_duplicate "$content"; then
# Skip - duplicate content
echo "Duplicate detected, skipping..." >&2
exit 0
fi
# Process new content
process_content "$content"
# Store hash to prevent future duplicates
store_content_hash "$content"
```
**Output**: Working content-based deduplication in your hook.
**Common Issues**:
- **Hash file grows too large**: Implement rotation (see Phase 3)
- **False positives**: Ensure content normalization (whitespace, formatting)
---
### Phase 3: Implement Hash Rotation
**Purpose**: Prevent hash file from growing indefinitely.
**Steps**:
1. **Set rotation limit**:
```bash
MAX_HASHES=10000 # Keep last 10,000 hashes
```
2. **Implement rotation logic**:
```bash
rotate_hash_file() {
local hash_file="$1"
local max_hashes="${2:-10000}"
# Count current hashes
local current_count=$(wc -l < "$hash_file")
# Rotate if needed
if [ "$current_count" -gt "$max_hashes" ]; then
tail -n "$max_hashes" "$hash_file" > "${hash_file}.tmp"
mv "${hash_file}.tmp" "$hash_file"
echo "Rotated hash file: kept last $max_hashes hashes" >&2
fi
}
```
3. **Call rotation periodically**:
```bash
# After storing new hash
store_content_hash "$content"
rotate_hash_file "$HASH_FILE" 10000
```
**Output**: Self-maintaining hash storage with bounded size.
**Common Issues**:
- **Rotation too aggressive**: Increase MAX_HASHES
- **Rotation too infrequent**: Consider checking count before every append
---
### Phase 4: Testing and Validation
**Purpose**: Verify deduplication works correctly.
**Steps**:
1. **Test duplicate detection**:
```bash
# First run - should process
echo "Test insight" | your_hook.sh
# Check: Content was processed
# Second run - should skip
echo "Test insight" | your_hook.sh
# Check: Duplicate detected message
```
2. **Test multiple unique items**:
```bash
echo "Insight 1" | your_hook.sh # Processed
echo "Insight 2" | your_hook.sh # Processed
echo "Insight 3" | your_hook.sh # Processed
echo "Insight 1" | your_hook.sh # Skipped (duplicate)
```
3. **Verify hash file**:
```bash
cat ~/.claude/state/hook-state/content-hashes.txt
# Should show 3 unique hashes (not 4)
```
4. **Test rotation**:
```bash
# Generate more than MAX_HASHES entries
for i in {1..10500}; do
echo "Insight $i" | your_hook.sh
done
# Verify file size bounded
wc -l ~/.claude/state/hook-state/content-hashes.txt
# Should be ~10000, not 10500
```
**Output**: Confirmed working deduplication with proper rotation.
---
## Reference Materials
- [Original Insight](data/insights-reference.md) - Full context on hook deduplication patterns
---
## Important Reminders
- **Use content-based deduplication for insights/documentation** - prevents duplicates across sessions
- **Use session-based deduplication for logs/events** - same event in different sessions is meaningful
- **Normalize content before hashing** - whitespace differences shouldn't create false negatives
- **Implement rotation** - prevent unbounded hash file growth
- **Hash storage location**: `~/.claude/state/hook-state/` (not project-specific)
- **SHA256 is fast** - no performance concerns for typical hook data
- **Test both paths** - verify both new content and duplicates work correctly
**Warnings**:
- ⚠️ **Do not use session ID alone** - prevents same insight in different sessions from being stored
- ⚠️ **Do not skip rotation** - hash file will grow indefinitely
- ⚠️ **Do not hash before normalization** - formatting changes will cause false negatives
---
## Best Practices
1. **Choose the Right Strategy**: Content-based for unique data, session-based for session-specific events
2. **Normalize Before Hashing**: Strip whitespace, lowercase if appropriate, consistent formatting
3. **Efficient Storage**: Use grep -Fxq for fast hash lookups (fixed-string, line-match, quiet)
4. **Bounded Growth**: Implement rotation to prevent file bloat
5. **Clear Logging**: Log when duplicates are detected for debugging
6. **State Location**: Use ~/.claude/state/hook-state/ for cross-project state
---
## Troubleshooting
### Duplicates not being detected
**Symptoms**: Same content processed multiple times
**Solution**:
1. Check hash file exists and is writable
2. Verify store_content_hash is called after processing
3. Check content normalization (whitespace differences)
4. Verify grep command uses -Fxq flags
**Prevention**: Test deduplication immediately after implementation
---
### Hash file growing too large
**Symptoms**: Hash file exceeds MAX_HASHES significantly
**Solution**:
1. Verify rotate_hash_file is called
2. Check MAX_HASHES value is reasonable
3. Manually rotate if needed: `tail -n 10000 hashes.txt > hashes.tmp && mv hashes.tmp hashes.txt`
**Prevention**: Call rotation after every hash storage
---
### False positives (new content marked as duplicate)
**Symptoms**: Different content being skipped
**Solution**:
1. Check for hash collisions (extremely unlikely with SHA256)
2. Verify content is actually different
3. Check normalization isn't too aggressive
4. Review recent hashes in file
**Prevention**: Use consistent normalization, test with diverse content
---
## Next Steps
After implementing deduplication:
1. Monitor hash file growth over time
2. Tune MAX_HASHES based on usage patterns
3. Consider adding metrics (duplicates prevented, storage size)
4. Share pattern with team for other hooks
---
## Metadata
**Source Insights**:
- Session: abc123-session-id
- Date: 2025-11-03
- Category: hooks-and-events
- File: docs/lessons-learned/hooks-and-events/2025-11-03-hook-deduplication.md
**Skill Version**: 0.1.0
**Generated**: 2025-11-16
**Last Updated**: 2025-11-16