Files

Zhongwei Li 8a3d331e04 Initial commit

2025-11-29 18:16:56 +08:00

9.7 KiB

Raw Permalink Blame History

name, description

name	description
hook-deduplication-guide	Use PROACTIVELY when developing Claude Code hooks to implement content-based deduplication and prevent duplicate insight storage across sessions

Hook Deduplication Guide

Overview

This skill guides you through implementing robust deduplication for Claude Code hooks, using content-based hashing instead of session-based tracking. Prevents duplicate insights from being stored while allowing multiple unique insights per session.

Based on 1 insight:

Hook Deduplication Session Management (hooks-and-events, 2025-11-03)

Key Capabilities:

Content-based deduplication using SHA256 hashes
Session-independent duplicate detection
Efficient hash storage with rotation
State management best practices

When to Use This Skill

Trigger Phrases:

"implement hook deduplication"
"prevent duplicate insights in hooks"
"content-based deduplication for hooks"
"hook state management patterns"

Use Cases:

Developing new Claude Code hooks that store data
Refactoring hooks to prevent duplicates
Implementing efficient state management for hooks
Debugging duplicate data issues in hooks

Do NOT use when:

Creating hooks that don't store data (read-only hooks)
Session-based deduplication is actually desired
Hook doesn't run frequently enough to need deduplication

Response Style

Educational and practical - explain the why behind content-based vs. session-based deduplication, then guide implementation with code examples.

Workflow

Phase 1: Choose Deduplication Strategy

Purpose: Determine whether content-based or session-based deduplication is appropriate.

Steps:

Assess hook behavior:
- How often does the hook run? (per message, per session, per event)
- What data is being stored? (insights, logs, metrics)
- Is the same content likely to appear across sessions?
Evaluate deduplication needs:
- Content-based: Use when the same insight/data might appear in different sessions
  - Example: Extract-explanatory-insights hook (same insight might appear in multiple conversations)
- Session-based: Use when duplicates should only be prevented within a session
  - Example: Error logging (same error in different sessions should be logged)
Recommend strategy:
- For insights/lessons-learned: Content-based (SHA256 hashing)
- For session logs/events: Session-based (session ID tracking)
- For unique events: No deduplication needed

Output: Clear recommendation on deduplication strategy.

Common Issues:

Unsure which to use: Default to content-based for data that's meant to be unique (insights, documentation)
Performance concerns: Content-based hashing is fast (<1ms for typical content)

Phase 2: Implement Content-Based Deduplication

Purpose: Set up SHA256 hash-based deduplication with state management.

Steps:

Create state directory:
```
mkdir -p ~/.claude/state/hook-state/
```

Initialize hash storage file:

HASH_FILE="$HOME/.claude/state/hook-state/content-hashes.txt"
touch "$HASH_FILE"

Implement hash generation:

# Generate SHA256 hash of content
compute_content_hash() {
  local content="$1"
  echo -n "$content" | sha256sum | awk '{print $1}'
}

Check for duplicates:

# Returns 0 if content is new, 1 if duplicate
is_duplicate() {
  local content="$1"
  local content_hash=$(compute_content_hash "$content")

  if grep -Fxq "$content_hash" "$HASH_FILE"; then
    return 1  # Duplicate found
  else
    return 0  # New content
  fi
}

Store hash after processing:

store_content_hash() {
  local content="$1"
  local content_hash=$(compute_content_hash "$content")
  echo "$content_hash" >> "$HASH_FILE"
}

Integrate into hook:

# In your hook script
content="extracted insight or data"

if is_duplicate "$content"; then
  # Skip - duplicate content
  echo "Duplicate detected, skipping..." >&2
  exit 0
fi

# Process new content
process_content "$content"

# Store hash to prevent future duplicates
store_content_hash "$content"

Output: Working content-based deduplication in your hook.

Common Issues:

Hash file grows too large: Implement rotation (see Phase 3)
False positives: Ensure content normalization (whitespace, formatting)

Phase 3: Implement Hash Rotation

Purpose: Prevent hash file from growing indefinitely.

Steps:

Set rotation limit:

MAX_HASHES=10000  # Keep last 10,000 hashes

Implement rotation logic:

rotate_hash_file() {
  local hash_file="$1"
  local max_hashes="${2:-10000}"

  # Count current hashes
  local current_count=$(wc -l < "$hash_file")

  # Rotate if needed
  if [ "$current_count" -gt "$max_hashes" ]; then
    tail -n "$max_hashes" "$hash_file" > "${hash_file}.tmp"
    mv "${hash_file}.tmp" "$hash_file"
    echo "Rotated hash file: kept last $max_hashes hashes" >&2
  fi
}

Call rotation periodically:

# After storing new hash
store_content_hash "$content"
rotate_hash_file "$HASH_FILE" 10000

Output: Self-maintaining hash storage with bounded size.

Common Issues:

Rotation too aggressive: Increase MAX_HASHES
Rotation too infrequent: Consider checking count before every append

Phase 4: Testing and Validation

Purpose: Verify deduplication works correctly.

Steps:

Test duplicate detection:

# First run - should process
echo "Test insight" | your_hook.sh
# Check: Content was processed

# Second run - should skip
echo "Test insight" | your_hook.sh
# Check: Duplicate detected message

Test multiple unique items:

echo "Insight 1" | your_hook.sh  # Processed
echo "Insight 2" | your_hook.sh  # Processed
echo "Insight 3" | your_hook.sh  # Processed
echo "Insight 1" | your_hook.sh  # Skipped (duplicate)

Verify hash file:

cat ~/.claude/state/hook-state/content-hashes.txt
# Should show 3 unique hashes (not 4)

Test rotation:

# Generate more than MAX_HASHES entries
for i in {1..10500}; do
  echo "Insight $i" | your_hook.sh
done

# Verify file size bounded
wc -l ~/.claude/state/hook-state/content-hashes.txt
# Should be ~10000, not 10500

Output: Confirmed working deduplication with proper rotation.

Reference Materials

Original Insight - Full context on hook deduplication patterns

Important Reminders

Use content-based deduplication for insights/documentation - prevents duplicates across sessions
Use session-based deduplication for logs/events - same event in different sessions is meaningful
Normalize content before hashing - whitespace differences shouldn't create false negatives
Implement rotation - prevent unbounded hash file growth
Hash storage location: ~/.claude/state/hook-state/ (not project-specific)
SHA256 is fast - no performance concerns for typical hook data
Test both paths - verify both new content and duplicates work correctly

Warnings:

⚠️ Do not use session ID alone - prevents same insight in different sessions from being stored
⚠️ Do not skip rotation - hash file will grow indefinitely
⚠️ Do not hash before normalization - formatting changes will cause false negatives

Best Practices

Choose the Right Strategy: Content-based for unique data, session-based for session-specific events
Normalize Before Hashing: Strip whitespace, lowercase if appropriate, consistent formatting
Efficient Storage: Use grep -Fxq for fast hash lookups (fixed-string, line-match, quiet)
Bounded Growth: Implement rotation to prevent file bloat
Clear Logging: Log when duplicates are detected for debugging
State Location: Use ~/.claude/state/hook-state/ for cross-project state

Troubleshooting

Duplicates not being detected

Symptoms: Same content processed multiple times

Solution:

Check hash file exists and is writable
Verify store_content_hash is called after processing
Check content normalization (whitespace differences)
Verify grep command uses -Fxq flags

Prevention: Test deduplication immediately after implementation

Hash file growing too large

Symptoms: Hash file exceeds MAX_HASHES significantly

Solution:

Verify rotate_hash_file is called
Check MAX_HASHES value is reasonable
Manually rotate if needed: tail -n 10000 hashes.txt > hashes.tmp && mv hashes.tmp hashes.txt

Prevention: Call rotation after every hash storage

False positives (new content marked as duplicate)

Symptoms: Different content being skipped

Solution:

Check for hash collisions (extremely unlikely with SHA256)
Verify content is actually different
Check normalization isn't too aggressive
Review recent hashes in file

Prevention: Use consistent normalization, test with diverse content

Next Steps

After implementing deduplication:

Monitor hash file growth over time
Tune MAX_HASHES based on usage patterns
Consider adding metrics (duplicates prevented, storage size)
Share pattern with team for other hooks

Metadata

Source Insights:

Session: abc123-session-id
Date: 2025-11-03
Category: hooks-and-events
File: docs/lessons-learned/hooks-and-events/2025-11-03-hook-deduplication.md

Skill Version: 0.1.0 Generated: 2025-11-16 Last Updated: 2025-11-16

9.7 KiB Raw Permalink Blame History

Hook Deduplication Guide

Overview

When to Use This Skill

Response Style

Workflow

Phase 1: Choose Deduplication Strategy

Phase 2: Implement Content-Based Deduplication

Phase 3: Implement Hash Rotation

Phase 4: Testing and Validation

Reference Materials

Important Reminders

Best Practices

Troubleshooting

Duplicates not being detected

Hash file growing too large

False positives (new content marked as duplicate)

Next Steps

Metadata

9.7 KiB

Raw Permalink Blame History