Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:16:56 +08:00
commit 8a3d331e04
61 changed files with 11808 additions and 0 deletions

View File

@@ -0,0 +1,286 @@
# Example: Clustering Analysis Output
This example shows what the clustering phase produces when analyzing a project's insights.
## Scenario
A project has been using the extract-explanatory-insights hook for 2 weeks, generating 12 insights across different categories.
---
## Phase 1: Discovery Summary
**Total Insights Found**: 12
**Date Range**: 2025-11-01 to 2025-11-14
**Unique Sessions**: 8
**Categories**:
- testing: 5 insights
- hooks-and-events: 3 insights
- architecture: 2 insights
- performance: 2 insights
**Preview**:
1. "Modern Testing Strategy with Testing Trophy" (testing, 2025-11-01)
2. "Hook Deduplication Session Management" (hooks-and-events, 2025-11-03)
3. "CPU Usage Prevention in Vitest" (testing, 2025-11-03)
4. "BSD awk Compatibility in Hook Scripts" (hooks-and-events, 2025-11-05)
5. "Semantic Query Priorities in React Testing Library" (testing, 2025-11-06)
---
## Phase 2: Clustering Analysis
### Cluster 1: Testing Strategy
**Size**: 3 insights
**Similarity Score**: 0.75 (high)
**Recommended Complexity**: Standard
**Recommended Pattern**: Validation
**Insights**:
1. "Modern Testing Strategy with Testing Trophy"
- Keywords: testing, integration, unit, e2e, trophy, kent-c-dodds
- Category: testing
- Date: 2025-11-01
- Length: 156 lines
- Has code examples: Yes
2. "Semantic Query Priorities in React Testing Library"
- Keywords: testing, react, semantic, query, getByRole, accessibility
- Category: testing
- Date: 2025-11-06
- Length: 89 lines
- Has code examples: Yes
3. "What NOT to Test - Brittle Patterns"
- Keywords: testing, avoid, brittle, implementation-details, user-behavior
- Category: testing
- Date: 2025-11-08
- Length: 67 lines
- Has code examples: No
**Shared Keywords**: testing (3), react (2), user (2), behavior (2), semantic (2)
**Cluster Characteristics**:
- All in same category (testing)
- Temporal span: 7 days
- Common theme: User-focused testing approach
- Total code examples: 5 blocks
- Actionable items: 12
**Suggested Skill Name**: "user-focused-testing-guide"
**Suggested Description**: "Use PROACTIVELY when writing tests to ensure user-centric testing strategy following Testing Trophy methodology and React Testing Library best practices"
**Skill Structure Recommendation**:
```
SKILL.md sections:
- Overview (Testing Trophy philosophy)
- Phase 1: Query Selection (semantic queries)
- Phase 2: Test Writing (user workflows)
- Phase 3: Avoiding Brittle Tests
- Important Reminders (what NOT to test)
- Examples (from code blocks)
```
---
### Cluster 2: Hook Development
**Size**: 2 insights
**Similarity Score**: 0.68 (medium-high)
**Recommended Complexity**: Minimal to Standard
**Recommended Pattern**: Phase-based
**Insights**:
1. "Hook Deduplication Session Management"
- Keywords: hook, deduplication, session, hash, state
- Category: hooks-and-events
- Date: 2025-11-03
- Length: 124 lines
- Has code examples: Yes (bash)
2. "BSD awk Compatibility in Hook Scripts"
- Keywords: hook, awk, bsd, macos, compatibility, portable
- Category: hooks-and-events
- Date: 2025-11-05
- Length: 78 lines
- Has code examples: Yes (bash)
**Shared Keywords**: hook (2), bash (2), script (2), compatibility (2)
**Cluster Characteristics**:
- Same category (hooks-and-events)
- Temporal span: 2 days (high proximity)
- Common theme: Hook script development best practices
- Total code examples: 4 blocks (all bash)
- Actionable items: 8
**Suggested Skill Name**: "hook-script-development-guide"
**Suggested Description**: "Guides Claude Code hook script development with deduplication patterns and cross-platform compatibility best practices for macOS/Linux environments"
**Skill Structure Recommendation**:
```
SKILL.md sections:
- Overview (Hook development challenges)
- Phase 1: Deduplication Strategy (session vs content)
- Phase 2: Cross-Platform Compatibility (BSD awk)
- Phase 3: Testing and Validation
- Examples (bash scripts)
- Templates (hook-template.sh)
```
---
### Cluster 3: Performance Optimization
**Size**: 2 insights
**Similarity Score**: 0.62 (medium)
**Recommended Complexity**: Minimal
**Recommended Pattern**: Phase-based
**Insights**:
1. "CPU Usage Prevention in Vitest"
- Keywords: vitest, cpu, performance, threads, watch-mode
- Category: testing
- Date: 2025-11-03
- Length: 45 lines
- Has code examples: Yes (npm scripts)
2. "Bundle Size Optimization"
- Keywords: bundle, size, webpack, optimization, tree-shaking
- Category: performance
- Date: 2025-11-09
- Length: 92 lines
- Has code examples: Yes (webpack config)
**Shared Keywords**: performance (2), optimization (2), build (2)
**Cluster Characteristics**:
- Different categories (testing, performance)
- Temporal span: 6 days
- Common theme: Build/test performance
- Total code examples: 3 blocks
- Actionable items: 6
**Suggested Skill Name**: "build-performance-optimizer"
**Suggested Description**: "Analyzes build and test performance issues and generates optimization recommendations for Vitest test runs and webpack bundle sizes"
**Skill Structure Recommendation**:
```
SKILL.md sections:
- Overview (Performance impact on DX)
- Phase 1: Test Performance Analysis
- Phase 2: Bundle Size Analysis
- Phase 3: Optimization Implementation
- Important Reminders (check before test, monitor bundle)
- Examples (npm scripts, webpack config)
```
---
### Standalone Insights
#### Standalone 1: "Hook State Management Patterns"
**Quality Score**: 0.85 (high - qualifies for standalone skill)
**Category**: hooks-and-events
**Date**: 2025-11-07
**Length**: 134 lines
**Code Examples**: Yes (3 blocks)
**Why Standalone**:
- Doesn't cluster with other hook insights (different focus)
- High quality with comprehensive coverage
- Self-contained topic (state management)
- Multiple actionable patterns
**Suggested Skill Name**: "hook-state-manager"
**Suggested Description**: "Automates state management setup for Claude Code hooks with persistent storage, cleanup, and safe concurrency patterns"
---
#### Standalone 2: "Architecture Decision Records"
**Quality Score**: 0.82 (high - qualifies for standalone skill)
**Category**: architecture
**Date**: 2025-11-12
**Length**: 156 lines
**Code Examples**: Yes (template)
**Why Standalone**:
- Unique topic (no other architecture insights)
- High quality with complete template
- Valuable for documentation
- Industry best practice
**Suggested Skill Name**: "adr-documentation-helper"
**Suggested Description**: "Guides creation of Architecture Decision Records (ADRs) following industry standards with templates and integration with project documentation"
---
### Low-Quality Insights (Not Recommended for Skills)
#### "Git Branch Naming Convention"
**Quality Score**: 0.42 (low)
**Category**: version-control
**Reason for Exclusion**: Too simple, covered by existing conventions, no unique value
#### "TypeScript Strict Mode Benefits"
**Quality Score**: 0.38 (low)
**Category**: typescript
**Reason for Exclusion**: Common knowledge, well-documented elsewhere, not actionable enough
---
## User Decision Points
At this stage, the skill would present the following options to the user:
**Option 1: Generate All Recommended Skills** (5 skills)
- user-focused-testing-guide (Cluster 1)
- hook-script-development-guide (Cluster 2)
- build-performance-optimizer (Cluster 3)
- hook-state-manager (Standalone 1)
- adr-documentation-helper (Standalone 2)
**Option 2: Select Specific Skills**
- User picks which clusters/standalones to convert
**Option 3: Modify Clusters**
- Split large clusters
- Merge small clusters
- Recategorize insights
- Adjust complexity levels
**Option 4: Tune Thresholds and Retry**
- Increase cluster_minimum (0.6 → 0.7) for tighter clusters
- Decrease standalone_quality (0.8 → 0.7) for more standalone skills
---
## Proceeding to Phase 3
If user selects "user-focused-testing-guide" to generate, the skill would proceed to Phase 3: Interactive Skill Design with the following proposal:
**Skill Design Proposal**:
- Name: `user-focused-testing-guide`
- Description: "Use PROACTIVELY when writing tests to ensure user-centric testing strategy following Testing Trophy methodology and React Testing Library best practices"
- Complexity: Standard
- Pattern: Validation
- Structure:
- SKILL.md with validation workflow
- data/insights-reference.md with 3 source insights
- examples/query-examples.md with semantic query patterns
- templates/test-checklist.md with testing checklist
User can then customize before generation begins.
---
**This example demonstrates**:
1. How clustering groups related insights
2. What information is presented for each cluster
3. How standalone insights are identified
4. Why some insights are excluded
5. What decisions users can make
6. How the process flows into Phase 3

View File

@@ -0,0 +1,24 @@
# Changelog
## [0.1.0] - 2025-11-16
### Added
- Initial release
- Generated from 1 insight (Hook Deduplication Session Management)
- Phase 1: Choose Deduplication Strategy
- Phase 2: Implement Content-Based Deduplication
- Phase 3: Implement Hash Rotation
- Phase 4: Testing and Validation
- Code examples for bash hook implementation
- Troubleshooting section
### Features
- Content-based deduplication using SHA256 hashes
- Session-independent duplicate detection
- Efficient hash storage with rotation
- State management best practices
### Generated By
- insight-skill-generator v0.1.0
- Source category: hooks-and-events
- Original insight date: 2025-11-03

View File

@@ -0,0 +1,51 @@
# Hook Deduplication Guide
Implement robust content-based deduplication for Claude Code hooks.
## Overview
This skill guides you through implementing SHA256 hash-based deduplication to prevent duplicate insights or data from being stored across sessions.
## When to Use
**Trigger Phrases**:
- "implement hook deduplication"
- "prevent duplicate insights in hooks"
- "content-based deduplication for hooks"
## Quick Start
```bash
# Test the skill
You: "I need to add deduplication to my hook to prevent storing the same insight twice"
Claude: [Activates hook-deduplication-guide]
- Explains content-based vs session-based strategies
- Guides implementation of SHA256 hashing
- Shows hash rotation to prevent file bloat
- Provides testing validation
```
## What You'll Get
- Content-based deduplication using SHA256
- Efficient hash storage with rotation
- Testing and validation guidance
- Best practices for hook state management
## Installation
```bash
# This is an example generated by insight-skill-generator
# Copy to your skills directory if you want to use it
cp -r examples/example-generated-skill ~/.claude/skills/hook-deduplication-guide
```
## Learn More
See [SKILL.md](SKILL.md) for complete workflow documentation.
---
**Generated by**: insight-skill-generator v0.1.0
**Source**: 1 insight from hooks-and-events category

View File

@@ -0,0 +1,342 @@
---
name: hook-deduplication-guide
description: Use PROACTIVELY when developing Claude Code hooks to implement content-based deduplication and prevent duplicate insight storage across sessions
---
# Hook Deduplication Guide
## Overview
This skill guides you through implementing robust deduplication for Claude Code hooks, using content-based hashing instead of session-based tracking. Prevents duplicate insights from being stored while allowing multiple unique insights per session.
**Based on 1 insight**:
- Hook Deduplication Session Management (hooks-and-events, 2025-11-03)
**Key Capabilities**:
- Content-based deduplication using SHA256 hashes
- Session-independent duplicate detection
- Efficient hash storage with rotation
- State management best practices
## When to Use This Skill
**Trigger Phrases**:
- "implement hook deduplication"
- "prevent duplicate insights in hooks"
- "content-based deduplication for hooks"
- "hook state management patterns"
**Use Cases**:
- Developing new Claude Code hooks that store data
- Refactoring hooks to prevent duplicates
- Implementing efficient state management for hooks
- Debugging duplicate data issues in hooks
**Do NOT use when**:
- Creating hooks that don't store data (read-only hooks)
- Session-based deduplication is actually desired
- Hook doesn't run frequently enough to need deduplication
## Response Style
Educational and practical - explain the why behind content-based vs. session-based deduplication, then guide implementation with code examples.
---
## Workflow
### Phase 1: Choose Deduplication Strategy
**Purpose**: Determine whether content-based or session-based deduplication is appropriate.
**Steps**:
1. **Assess hook behavior**:
- How often does the hook run? (per message, per session, per event)
- What data is being stored? (insights, logs, metrics)
- Is the same content likely to appear across sessions?
2. **Evaluate deduplication needs**:
- **Content-based**: Use when the same insight/data might appear in different sessions
- Example: Extract-explanatory-insights hook (same insight might appear in multiple conversations)
- **Session-based**: Use when duplicates should only be prevented within a session
- Example: Error logging (same error in different sessions should be logged)
3. **Recommend strategy**:
- For insights/lessons-learned: Content-based (SHA256 hashing)
- For session logs/events: Session-based (session ID tracking)
- For unique events: No deduplication needed
**Output**: Clear recommendation on deduplication strategy.
**Common Issues**:
- **Unsure which to use**: Default to content-based for data that's meant to be unique (insights, documentation)
- **Performance concerns**: Content-based hashing is fast (<1ms for typical content)
---
### Phase 2: Implement Content-Based Deduplication
**Purpose**: Set up SHA256 hash-based deduplication with state management.
**Steps**:
1. **Create state directory**:
```bash
mkdir -p ~/.claude/state/hook-state/
```
2. **Initialize hash storage file**:
```bash
HASH_FILE="$HOME/.claude/state/hook-state/content-hashes.txt"
touch "$HASH_FILE"
```
3. **Implement hash generation**:
```bash
# Generate SHA256 hash of content
compute_content_hash() {
local content="$1"
echo -n "$content" | sha256sum | awk '{print $1}'
}
```
4. **Check for duplicates**:
```bash
# Returns 0 if content is new, 1 if duplicate
is_duplicate() {
local content="$1"
local content_hash=$(compute_content_hash "$content")
if grep -Fxq "$content_hash" "$HASH_FILE"; then
return 1 # Duplicate found
else
return 0 # New content
fi
}
```
5. **Store hash after processing**:
```bash
store_content_hash() {
local content="$1"
local content_hash=$(compute_content_hash "$content")
echo "$content_hash" >> "$HASH_FILE"
}
```
6. **Integrate into hook**:
```bash
# In your hook script
content="extracted insight or data"
if is_duplicate "$content"; then
# Skip - duplicate content
echo "Duplicate detected, skipping..." >&2
exit 0
fi
# Process new content
process_content "$content"
# Store hash to prevent future duplicates
store_content_hash "$content"
```
**Output**: Working content-based deduplication in your hook.
**Common Issues**:
- **Hash file grows too large**: Implement rotation (see Phase 3)
- **False positives**: Ensure content normalization (whitespace, formatting)
---
### Phase 3: Implement Hash Rotation
**Purpose**: Prevent hash file from growing indefinitely.
**Steps**:
1. **Set rotation limit**:
```bash
MAX_HASHES=10000 # Keep last 10,000 hashes
```
2. **Implement rotation logic**:
```bash
rotate_hash_file() {
local hash_file="$1"
local max_hashes="${2:-10000}"
# Count current hashes
local current_count=$(wc -l < "$hash_file")
# Rotate if needed
if [ "$current_count" -gt "$max_hashes" ]; then
tail -n "$max_hashes" "$hash_file" > "${hash_file}.tmp"
mv "${hash_file}.tmp" "$hash_file"
echo "Rotated hash file: kept last $max_hashes hashes" >&2
fi
}
```
3. **Call rotation periodically**:
```bash
# After storing new hash
store_content_hash "$content"
rotate_hash_file "$HASH_FILE" 10000
```
**Output**: Self-maintaining hash storage with bounded size.
**Common Issues**:
- **Rotation too aggressive**: Increase MAX_HASHES
- **Rotation too infrequent**: Consider checking count before every append
---
### Phase 4: Testing and Validation
**Purpose**: Verify deduplication works correctly.
**Steps**:
1. **Test duplicate detection**:
```bash
# First run - should process
echo "Test insight" | your_hook.sh
# Check: Content was processed
# Second run - should skip
echo "Test insight" | your_hook.sh
# Check: Duplicate detected message
```
2. **Test multiple unique items**:
```bash
echo "Insight 1" | your_hook.sh # Processed
echo "Insight 2" | your_hook.sh # Processed
echo "Insight 3" | your_hook.sh # Processed
echo "Insight 1" | your_hook.sh # Skipped (duplicate)
```
3. **Verify hash file**:
```bash
cat ~/.claude/state/hook-state/content-hashes.txt
# Should show 3 unique hashes (not 4)
```
4. **Test rotation**:
```bash
# Generate more than MAX_HASHES entries
for i in {1..10500}; do
echo "Insight $i" | your_hook.sh
done
# Verify file size bounded
wc -l ~/.claude/state/hook-state/content-hashes.txt
# Should be ~10000, not 10500
```
**Output**: Confirmed working deduplication with proper rotation.
---
## Reference Materials
- [Original Insight](data/insights-reference.md) - Full context on hook deduplication patterns
---
## Important Reminders
- **Use content-based deduplication for insights/documentation** - prevents duplicates across sessions
- **Use session-based deduplication for logs/events** - same event in different sessions is meaningful
- **Normalize content before hashing** - whitespace differences shouldn't create false negatives
- **Implement rotation** - prevent unbounded hash file growth
- **Hash storage location**: `~/.claude/state/hook-state/` (not project-specific)
- **SHA256 is fast** - no performance concerns for typical hook data
- **Test both paths** - verify both new content and duplicates work correctly
**Warnings**:
- ⚠️ **Do not use session ID alone** - prevents same insight in different sessions from being stored
- ⚠️ **Do not skip rotation** - hash file will grow indefinitely
- ⚠️ **Do not hash before normalization** - formatting changes will cause false negatives
---
## Best Practices
1. **Choose the Right Strategy**: Content-based for unique data, session-based for session-specific events
2. **Normalize Before Hashing**: Strip whitespace, lowercase if appropriate, consistent formatting
3. **Efficient Storage**: Use grep -Fxq for fast hash lookups (fixed-string, line-match, quiet)
4. **Bounded Growth**: Implement rotation to prevent file bloat
5. **Clear Logging**: Log when duplicates are detected for debugging
6. **State Location**: Use ~/.claude/state/hook-state/ for cross-project state
---
## Troubleshooting
### Duplicates not being detected
**Symptoms**: Same content processed multiple times
**Solution**:
1. Check hash file exists and is writable
2. Verify store_content_hash is called after processing
3. Check content normalization (whitespace differences)
4. Verify grep command uses -Fxq flags
**Prevention**: Test deduplication immediately after implementation
---
### Hash file growing too large
**Symptoms**: Hash file exceeds MAX_HASHES significantly
**Solution**:
1. Verify rotate_hash_file is called
2. Check MAX_HASHES value is reasonable
3. Manually rotate if needed: `tail -n 10000 hashes.txt > hashes.tmp && mv hashes.tmp hashes.txt`
**Prevention**: Call rotation after every hash storage
---
### False positives (new content marked as duplicate)
**Symptoms**: Different content being skipped
**Solution**:
1. Check for hash collisions (extremely unlikely with SHA256)
2. Verify content is actually different
3. Check normalization isn't too aggressive
4. Review recent hashes in file
**Prevention**: Use consistent normalization, test with diverse content
---
## Next Steps
After implementing deduplication:
1. Monitor hash file growth over time
2. Tune MAX_HASHES based on usage patterns
3. Consider adding metrics (duplicates prevented, storage size)
4. Share pattern with team for other hooks
---
## Metadata
**Source Insights**:
- Session: abc123-session-id
- Date: 2025-11-03
- Category: hooks-and-events
- File: docs/lessons-learned/hooks-and-events/2025-11-03-hook-deduplication.md
**Skill Version**: 0.1.0
**Generated**: 2025-11-16
**Last Updated**: 2025-11-16

View File

@@ -0,0 +1,116 @@
# Insights Reference: hook-deduplication-guide
This document contains the original insight from Claude Code's Explanatory output style that was used to create the **Hook Deduplication Guide** skill.
## Overview
**Total Insights**: 1
**Date Range**: 2025-11-03
**Categories**: hooks-and-events
**Sessions**: 1 unique session
---
## 1. Hook Deduplication Session Management
**Metadata**:
- **Date**: 2025-11-03
- **Category**: hooks-and-events
- **Session**: abc123-session-id
- **Source File**: docs/lessons-learned/hooks-and-events/2025-11-03-hook-deduplication.md
**Original Content**:
The extract-explanatory-insights hook initially used session-based deduplication, which prevented multiple insights from the same session from being stored. However, this created a limitation: if the same valuable insight appeared in different sessions, only the first one would be saved.
By switching to content-based deduplication using SHA256 hashing, we can:
1. **Allow multiple unique insights per session** - Different insights in the same conversation are all preserved
2. **Prevent true duplicates across sessions** - The same insight appearing in multiple conversations is stored only once
3. **Maintain efficient storage** - Hash file rotation keeps storage bounded
The implementation involves:
**Hash Generation**:
```bash
compute_content_hash() {
local content="$1"
echo -n "$content" | sha256sum | awk '{print $1}'
}
```
**Duplicate Detection**:
```bash
is_duplicate() {
local content="$1"
local content_hash=$(compute_content_hash "$content")
if grep -Fxq "$content_hash" "$HASH_FILE"; then
return 1 # Duplicate
else
return 0 # New content
fi
}
```
**Hash Storage with Rotation**:
```bash
store_content_hash() {
local content="$1"
local content_hash=$(compute_content_hash "$content")
echo "$content_hash" >> "$HASH_FILE"
# Rotate if file exceeds MAX_HASHES
local count=$(wc -l < "$HASH_FILE")
if [ "$count" -gt 10000 ]; then
tail -n 10000 "$HASH_FILE" > "${HASH_FILE}.tmp"
mv "${HASH_FILE}.tmp" "$HASH_FILE"
fi
}
```
This approach provides the best of both worlds: session independence and true deduplication based on content, not session boundaries.
---
## How This Insight Informs the Skill
### Hook Deduplication Session Management → Phase-Based Workflow
The insight's structure (problem → solution → implementation) maps directly to the skill's phases:
- **Problem Description** → Phase 1: Choose Deduplication Strategy
- Explains why session-based is insufficient
- Defines when content-based is needed
- **Solution Explanation** → Phase 2: Implement Content-Based Deduplication
- Hash generation logic
- Duplicate detection mechanism
- State file management
- **Implementation Details** → Phase 3: Implement Hash Rotation
- Rotation logic to prevent unbounded growth
- MAX_HASHES configuration
- **Code Examples** → All phases
- Bash functions extracted and integrated into workflow steps
---
## Additional Context
**Why This Insight Was Selected**:
This insight was selected for skill generation because it:
1. Provides a complete, actionable pattern
2. Includes working code examples
3. Solves a common problem in hook development
4. Is generally applicable (not project-specific)
5. Has clear benefits over the naive approach
**Quality Score**: 0.85 (high - qualified for standalone skill)
---
**Generated**: 2025-11-16
**Last Updated**: 2025-11-16

View File

@@ -0,0 +1,15 @@
{
"name": "hook-deduplication-guide",
"version": "0.1.0",
"description": "Use PROACTIVELY when developing Claude Code hooks to implement content-based deduplication and prevent duplicate insight storage across sessions",
"type": "skill",
"author": "Connor",
"category": "productivity",
"tags": [
"hooks",
"deduplication",
"state-management",
"bash",
"generated-from-insights"
]
}