Initial commit

2025-11-29 18:16:56 +08:00
commit 8a3d331e04
61 changed files with 11808 additions and 0 deletions
--- a/skills/insight-skill-generator/examples/example-generated-skill/SKILL.md
+++ b/skills/insight-skill-generator/examples/example-generated-skill/SKILL.md
@@ -0,0 +1,342 @@
+---
+name: hook-deduplication-guide
+description: Use PROACTIVELY when developing Claude Code hooks to implement content-based deduplication and prevent duplicate insight storage across sessions
+---
+
+# Hook Deduplication Guide
+
+## Overview
+
+This skill guides you through implementing robust deduplication for Claude Code hooks, using content-based hashing instead of session-based tracking. Prevents duplicate insights from being stored while allowing multiple unique insights per session.
+
+**Based on 1 insight**:
+- Hook Deduplication Session Management (hooks-and-events, 2025-11-03)
+
+**Key Capabilities**:
+- Content-based deduplication using SHA256 hashes
+- Session-independent duplicate detection
+- Efficient hash storage with rotation
+- State management best practices
+
+## When to Use This Skill
+
+**Trigger Phrases**:
+- "implement hook deduplication"
+- "prevent duplicate insights in hooks"
+- "content-based deduplication for hooks"
+- "hook state management patterns"
+
+**Use Cases**:
+- Developing new Claude Code hooks that store data
+- Refactoring hooks to prevent duplicates
+- Implementing efficient state management for hooks
+- Debugging duplicate data issues in hooks
+
+**Do NOT use when**:
+- Creating hooks that don't store data (read-only hooks)
+- Session-based deduplication is actually desired
+- Hook doesn't run frequently enough to need deduplication
+
+## Response Style
+
+Educational and practical - explain the why behind content-based vs. session-based deduplication, then guide implementation with code examples.
+
+---
+
+## Workflow
+
+### Phase 1: Choose Deduplication Strategy
+
+**Purpose**: Determine whether content-based or session-based deduplication is appropriate.
+
+**Steps**:
+
+1. **Assess hook behavior**:
+   - How often does the hook run? (per message, per session, per event)
+   - What data is being stored? (insights, logs, metrics)
+   - Is the same content likely to appear across sessions?
+
+2. **Evaluate deduplication needs**:
+   - **Content-based**: Use when the same insight/data might appear in different sessions
+     - Example: Extract-explanatory-insights hook (same insight might appear in multiple conversations)
+   - **Session-based**: Use when duplicates should only be prevented within a session
+     - Example: Error logging (same error in different sessions should be logged)
+
+3. **Recommend strategy**:
+   - For insights/lessons-learned: Content-based (SHA256 hashing)
+   - For session logs/events: Session-based (session ID tracking)
+   - For unique events: No deduplication needed
+
+**Output**: Clear recommendation on deduplication strategy.
+
+**Common Issues**:
+- **Unsure which to use**: Default to content-based for data that's meant to be unique (insights, documentation)
+- **Performance concerns**: Content-based hashing is fast (<1ms for typical content)
+
+---
+
+### Phase 2: Implement Content-Based Deduplication
+
+**Purpose**: Set up SHA256 hash-based deduplication with state management.
+
+**Steps**:
+
+1. **Create state directory**:
+   ```bash
+   mkdir -p ~/.claude/state/hook-state/
+   ```
+
+2. **Initialize hash storage file**:
+   ```bash
+   HASH_FILE="$HOME/.claude/state/hook-state/content-hashes.txt"
+   touch "$HASH_FILE"
+   ```
+
+3. **Implement hash generation**:
+   ```bash
+   # Generate SHA256 hash of content
+   compute_content_hash() {
+     local content="$1"
+     echo -n "$content" | sha256sum | awk '{print $1}'
+   }
+   ```
+
+4. **Check for duplicates**:
+   ```bash
+   # Returns 0 if content is new, 1 if duplicate
+   is_duplicate() {
+     local content="$1"
+     local content_hash=$(compute_content_hash "$content")
+
+     if grep -Fxq "$content_hash" "$HASH_FILE"; then
+       return 1  # Duplicate found
+     else
+       return 0  # New content
+     fi
+   }
+   ```
+
+5. **Store hash after processing**:
+   ```bash
+   store_content_hash() {
+     local content="$1"
+     local content_hash=$(compute_content_hash "$content")
+     echo "$content_hash" >> "$HASH_FILE"
+   }
+   ```
+
+6. **Integrate into hook**:
+   ```bash
+   # In your hook script
+   content="extracted insight or data"
+
+   if is_duplicate "$content"; then
+     # Skip - duplicate content
+     echo "Duplicate detected, skipping..." >&2
+     exit 0
+   fi
+
+   # Process new content
+   process_content "$content"
+
+   # Store hash to prevent future duplicates
+   store_content_hash "$content"
+   ```
+
+**Output**: Working content-based deduplication in your hook.
+
+**Common Issues**:
+- **Hash file grows too large**: Implement rotation (see Phase 3)
+- **False positives**: Ensure content normalization (whitespace, formatting)
+
+---
+
+### Phase 3: Implement Hash Rotation
+
+**Purpose**: Prevent hash file from growing indefinitely.
+
+**Steps**:
+
+1. **Set rotation limit**:
+   ```bash
+   MAX_HASHES=10000  # Keep last 10,000 hashes
+   ```
+
+2. **Implement rotation logic**:
+   ```bash
+   rotate_hash_file() {
+     local hash_file="$1"
+     local max_hashes="${2:-10000}"
+
+     # Count current hashes
+     local current_count=$(wc -l < "$hash_file")
+
+     # Rotate if needed
+     if [ "$current_count" -gt "$max_hashes" ]; then
+       tail -n "$max_hashes" "$hash_file" > "${hash_file}.tmp"
+       mv "${hash_file}.tmp" "$hash_file"
+       echo "Rotated hash file: kept last $max_hashes hashes" >&2
+     fi
+   }
+   ```
+
+3. **Call rotation periodically**:
+   ```bash
+   # After storing new hash
+   store_content_hash "$content"
+   rotate_hash_file "$HASH_FILE" 10000
+   ```
+
+**Output**: Self-maintaining hash storage with bounded size.
+
+**Common Issues**:
+- **Rotation too aggressive**: Increase MAX_HASHES
+- **Rotation too infrequent**: Consider checking count before every append
+
+---
+
+### Phase 4: Testing and Validation
+
+**Purpose**: Verify deduplication works correctly.
+
+**Steps**:
+
+1. **Test duplicate detection**:
+   ```bash
+   # First run - should process
+   echo "Test insight" | your_hook.sh
+   # Check: Content was processed
+
+   # Second run - should skip
+   echo "Test insight" | your_hook.sh
+   # Check: Duplicate detected message
+   ```
+
+2. **Test multiple unique items**:
+   ```bash
+   echo "Insight 1" | your_hook.sh  # Processed
+   echo "Insight 2" | your_hook.sh  # Processed
+   echo "Insight 3" | your_hook.sh  # Processed
+   echo "Insight 1" | your_hook.sh  # Skipped (duplicate)
+   ```
+
+3. **Verify hash file**:
+   ```bash
+   cat ~/.claude/state/hook-state/content-hashes.txt
+   # Should show 3 unique hashes (not 4)
+   ```
+
+4. **Test rotation**:
+   ```bash
+   # Generate more than MAX_HASHES entries
+   for i in {1..10500}; do
+     echo "Insight $i" | your_hook.sh
+   done
+
+   # Verify file size bounded
+   wc -l ~/.claude/state/hook-state/content-hashes.txt
+   # Should be ~10000, not 10500
+   ```
+
+**Output**: Confirmed working deduplication with proper rotation.
+
+---
+
+## Reference Materials
+
+- [Original Insight](data/insights-reference.md) - Full context on hook deduplication patterns
+
+---
+
+## Important Reminders
+
+- **Use content-based deduplication for insights/documentation** - prevents duplicates across sessions
+- **Use session-based deduplication for logs/events** - same event in different sessions is meaningful
+- **Normalize content before hashing** - whitespace differences shouldn't create false negatives
+- **Implement rotation** - prevent unbounded hash file growth
+- **Hash storage location**: `~/.claude/state/hook-state/` (not project-specific)
+- **SHA256 is fast** - no performance concerns for typical hook data
+- **Test both paths** - verify both new content and duplicates work correctly
+
+**Warnings**:
+- ⚠️  **Do not use session ID alone** - prevents same insight in different sessions from being stored
+- ⚠️  **Do not skip rotation** - hash file will grow indefinitely
+- ⚠️  **Do not hash before normalization** - formatting changes will cause false negatives
+
+---
+
+## Best Practices
+
+1. **Choose the Right Strategy**: Content-based for unique data, session-based for session-specific events
+2. **Normalize Before Hashing**: Strip whitespace, lowercase if appropriate, consistent formatting
+3. **Efficient Storage**: Use grep -Fxq for fast hash lookups (fixed-string, line-match, quiet)
+4. **Bounded Growth**: Implement rotation to prevent file bloat
+5. **Clear Logging**: Log when duplicates are detected for debugging
+6. **State Location**: Use ~/.claude/state/hook-state/ for cross-project state
+
+---
+
+## Troubleshooting
+
+### Duplicates not being detected
+
+**Symptoms**: Same content processed multiple times
+
+**Solution**:
+1. Check hash file exists and is writable
+2. Verify store_content_hash is called after processing
+3. Check content normalization (whitespace differences)
+4. Verify grep command uses -Fxq flags
+
+**Prevention**: Test deduplication immediately after implementation
+
+---
+
+### Hash file growing too large
+
+**Symptoms**: Hash file exceeds MAX_HASHES significantly
+
+**Solution**:
+1. Verify rotate_hash_file is called
+2. Check MAX_HASHES value is reasonable
+3. Manually rotate if needed: `tail -n 10000 hashes.txt > hashes.tmp && mv hashes.tmp hashes.txt`
+
+**Prevention**: Call rotation after every hash storage
+
+---
+
+### False positives (new content marked as duplicate)
+
+**Symptoms**: Different content being skipped
+
+**Solution**:
+1. Check for hash collisions (extremely unlikely with SHA256)
+2. Verify content is actually different
+3. Check normalization isn't too aggressive
+4. Review recent hashes in file
+
+**Prevention**: Use consistent normalization, test with diverse content
+
+---
+
+## Next Steps
+
+After implementing deduplication:
+1. Monitor hash file growth over time
+2. Tune MAX_HASHES based on usage patterns
+3. Consider adding metrics (duplicates prevented, storage size)
+4. Share pattern with team for other hooks
+
+---
+
+## Metadata
+
+**Source Insights**:
+- Session: abc123-session-id
+- Date: 2025-11-03
+- Category: hooks-and-events
+- File: docs/lessons-learned/hooks-and-events/2025-11-03-hook-deduplication.md
+
+**Skill Version**: 0.1.0
+**Generated**: 2025-11-16
+**Last Updated**: 2025-11-16