Initial commit

2025-11-30 08:32:16 +08:00
commit f3ffbc28d8
9 changed files with 949 additions and 0 deletions
--- a/skills/csv-data-summarizer/SKILL.md
+++ b/skills/csv-data-summarizer/SKILL.md
@@ -0,0 +1,149 @@
+---
+name: csv-data-summarizer
+description: Analyzes CSV files, generates summary stats, and plots quick visualizations using Python and pandas.
+metadata:
+  version: 2.1.0
+  dependencies: python>=3.8, pandas>=2.0.0, matplotlib>=3.7.0, seaborn>=0.12.0
+---
+
+# CSV Data Summarizer
+
+This Skill analyzes CSV files and provides comprehensive summaries with statistical insights and visualizations.
+
+## When to Use This Skill
+
+Claude should use this Skill whenever the user:
+- Uploads or references a CSV file
+- Asks to summarize, analyze, or visualize tabular data
+- Requests insights from CSV data
+- Wants to understand data structure and quality
+
+## How It Works
+
+## ⚠️ CRITICAL BEHAVIOR REQUIREMENT ⚠️
+
+**DO NOT ASK THE USER WHAT THEY WANT TO DO WITH THE DATA.**
+**DO NOT OFFER OPTIONS OR CHOICES.**
+**DO NOT SAY "What would you like me to help you with?"**
+**DO NOT LIST POSSIBLE ANALYSES.**
+
+**IMMEDIATELY AND AUTOMATICALLY:**
+1. Run the comprehensive analysis
+2. Generate ALL relevant visualizations
+3. Present complete results
+4. NO questions, NO options, NO waiting for user input
+
+**THE USER WANTS A FULL ANALYSIS RIGHT AWAY - JUST DO IT.**
+
+### Automatic Analysis Steps:
+
+**The skill intelligently adapts to different data types and industries by inspecting the data first, then determining what analyses are most relevant.**
+
+1. **Load and inspect** the CSV file into pandas DataFrame
+2. **Identify data structure** - column types, date columns, numeric columns, categories
+3. **Determine relevant analyses** based on what's actually in the data:
+   - **Sales/E-commerce data** (order dates, revenue, products): Time-series trends, revenue analysis, product performance
+   - **Customer data** (demographics, segments, regions): Distribution analysis, segmentation, geographic patterns
+   - **Financial data** (transactions, amounts, dates): Trend analysis, statistical summaries, correlations
+   - **Operational data** (timestamps, metrics, status): Time-series, performance metrics, distributions
+   - **Survey data** (categorical responses, ratings): Frequency analysis, cross-tabulations, distributions
+   - **Generic tabular data**: Adapts based on column types found
+
+4. **Only create visualizations that make sense** for the specific dataset:
+   - Time-series plots ONLY if date/timestamp columns exist
+   - Correlation heatmaps ONLY if multiple numeric columns exist
+   - Category distributions ONLY if categorical columns exist
+   - Histograms for numeric distributions when relevant
+   
+5. **Generate comprehensive output** automatically including:
+   - Data overview (rows, columns, types)
+   - Key statistics and metrics relevant to the data type
+   - Missing data analysis
+   - Multiple relevant visualizations (only those that apply)
+   - Actionable insights based on patterns found in THIS specific dataset
+   
+6. **Present everything** in one complete analysis - no follow-up questions
+
+**Example adaptations:**
+- Healthcare data with patient IDs → Focus on demographics, treatment patterns, temporal trends
+- Inventory data with stock levels → Focus on quantity distributions, reorder patterns, SKU analysis  
+- Web analytics with timestamps → Focus on traffic patterns, conversion metrics, time-of-day analysis
+- Survey responses → Focus on response distributions, demographic breakdowns, sentiment patterns
+
+### Behavior Guidelines
+
+✅ **CORRECT APPROACH - SAY THIS:**
+- "I'll analyze this data comprehensively right now."
+- "Here's the complete analysis with visualizations:"
+- "I've identified this as [type] data and generated relevant insights:"
+- Then IMMEDIATELY show the full analysis
+
+✅ **DO:**
+- Immediately run the analysis script
+- Generate ALL relevant charts automatically
+- Provide complete insights without being asked
+- Be thorough and complete in first response
+- Act decisively without asking permission
+
+❌ **NEVER SAY THESE PHRASES:**
+- "What would you like to do with this data?"
+- "What would you like me to help you with?"
+- "Here are some common options:"
+- "Let me know what you'd like help with"
+- "I can create a comprehensive analysis if you'd like!"
+- Any sentence ending with "?" asking for user direction
+- Any list of options or choices
+- Any conditional "I can do X if you want"
+
+❌ **FORBIDDEN BEHAVIORS:**
+- Asking what the user wants
+- Listing options for the user to choose from
+- Waiting for user direction before analyzing
+- Providing partial analysis that requires follow-up
+- Describing what you COULD do instead of DOING it
+
+### Usage
+
+The Skill provides a Python function `summarize_csv(file_path)` that:
+- Accepts a path to a CSV file
+- Returns a comprehensive text summary with statistics
+- Generates multiple visualizations automatically based on data structure
+
+### Example Prompts
+
+> "Here's `sales_data.csv`. Can you summarize this file?"
+
+> "Analyze this customer data CSV and show me trends."
+
+> "What insights can you find in `orders.csv`?"
+
+### Example Output
+
+**Dataset Overview**
+- 5,000 rows × 8 columns  
+- 3 numeric columns, 1 date column  
+
+**Summary Statistics**
+- Average order value: $58.2  
+- Standard deviation: $12.4
+- Missing values: 2% (100 cells)
+
+**Insights**
+- Sales show upward trend over time
+- Peak activity in Q4
+*(Attached: trend plot)*
+
+## Files
+
+- `analyze.py` - Core analysis logic
+- `requirements.txt` - Python dependencies
+- `resources/sample.csv` - Example dataset for testing
+- `resources/README.md` - Additional documentation
+
+## Notes
+
+- Automatically detects date columns (columns containing 'date' in name)
+- Handles missing data gracefully
+- Generates visualizations only when date columns are present
+- All numeric columns are included in statistical summary
+
--- a/skills/csv-data-summarizer/analyze.py
+++ b/skills/csv-data-summarizer/analyze.py
@@ -0,0 +1,182 @@
+import pandas as pd
+import matplotlib.pyplot as plt
+import seaborn as sns
+from pathlib import Path
+
+def summarize_csv(file_path):
+    """
+    Comprehensively analyzes a CSV file and generates multiple visualizations.
+    
+    Args:
+        file_path (str): Path to the CSV file
+        
+    Returns:
+        str: Formatted comprehensive analysis of the dataset
+    """
+    df = pd.read_csv(file_path)
+    summary = []
+    charts_created = []
+    
+    # Basic info
+    summary.append("=" * 60)
+    summary.append("📊 DATA OVERVIEW")
+    summary.append("=" * 60)
+    summary.append(f"Rows: {df.shape[0]:,} | Columns: {df.shape[1]}")
+    summary.append(f"\nColumns: {', '.join(df.columns.tolist())}")
+    
+    # Data types
+    summary.append(f"\n📋 DATA TYPES:")
+    for col, dtype in df.dtypes.items():
+        summary.append(f"  • {col}: {dtype}")
+    
+    # Missing data analysis
+    missing = df.isnull().sum().sum()
+    missing_pct = (missing / (df.shape[0] * df.shape[1])) * 100
+    summary.append(f"\n🔍 DATA QUALITY:")
+    if missing:
+        summary.append(f"Missing values: {missing:,} ({missing_pct:.2f}% of total data)")
+        summary.append("Missing by column:")
+        for col in df.columns:
+            col_missing = df[col].isnull().sum()
+            if col_missing > 0:
+                col_pct = (col_missing / len(df)) * 100
+                summary.append(f"  • {col}: {col_missing:,} ({col_pct:.1f}%)")
+    else:
+        summary.append("✓ No missing values - dataset is complete!")
+    
+    # Numeric analysis
+    numeric_cols = df.select_dtypes(include='number').columns.tolist()
+    if numeric_cols:
+        summary.append(f"\n📈 NUMERICAL ANALYSIS:")
+        summary.append(str(df[numeric_cols].describe()))
+        
+        # Correlations if multiple numeric columns
+        if len(numeric_cols) > 1:
+            summary.append(f"\n🔗 CORRELATIONS:")
+            corr_matrix = df[numeric_cols].corr()
+            summary.append(str(corr_matrix))
+            
+            # Create correlation heatmap
+            plt.figure(figsize=(10, 8))
+            sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, 
+                       square=True, linewidths=1)
+            plt.title('Correlation Heatmap')
+            plt.tight_layout()
+            plt.savefig('correlation_heatmap.png', dpi=150)
+            plt.close()
+            charts_created.append('correlation_heatmap.png')
+    
+    # Categorical analysis
+    categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
+    categorical_cols = [c for c in categorical_cols if 'id' not in c.lower()]
+    
+    if categorical_cols:
+        summary.append(f"\n📊 CATEGORICAL ANALYSIS:")
+        for col in categorical_cols[:5]:  # Limit to first 5
+            value_counts = df[col].value_counts()
+            summary.append(f"\n{col}:")
+            for val, count in value_counts.head(10).items():
+                pct = (count / len(df)) * 100
+                summary.append(f"  • {val}: {count:,} ({pct:.1f}%)")
+    
+    # Time series analysis
+    date_cols = [c for c in df.columns if 'date' in c.lower() or 'time' in c.lower()]
+    if date_cols:
+        summary.append(f"\n📅 TIME SERIES ANALYSIS:")
+        date_col = date_cols[0]
+        df[date_col] = pd.to_datetime(df[date_col], errors='coerce')
+        
+        date_range = df[date_col].max() - df[date_col].min()
+        summary.append(f"Date range: {df[date_col].min()} to {df[date_col].max()}")
+        summary.append(f"Span: {date_range.days} days")
+        
+        # Create time-series plots for numeric columns
+        if numeric_cols:
+            fig, axes = plt.subplots(min(3, len(numeric_cols)), 1, 
+                                    figsize=(12, 4 * min(3, len(numeric_cols))))
+            if len(numeric_cols) == 1:
+                axes = [axes]
+            
+            for idx, num_col in enumerate(numeric_cols[:3]):
+                ax = axes[idx] if len(numeric_cols) > 1 else axes[0]
+                daily_data = df.groupby(date_col)[num_col].agg(['mean', 'sum', 'count'])
+                daily_data['mean'].plot(ax=ax, label='Average', linewidth=2)
+                ax.set_title(f'{num_col} Over Time')
+                ax.set_xlabel('Date')
+                ax.set_ylabel(num_col)
+                ax.legend()
+                ax.grid(True, alpha=0.3)
+            
+            plt.tight_layout()
+            plt.savefig('time_series_analysis.png', dpi=150)
+            plt.close()
+            charts_created.append('time_series_analysis.png')
+    
+    # Distribution plots for numeric columns
+    if numeric_cols:
+        n_cols = min(4, len(numeric_cols))
+        fig, axes = plt.subplots(2, 2, figsize=(12, 10))
+        axes = axes.flatten()
+        
+        for idx, col in enumerate(numeric_cols[:4]):
+            axes[idx].hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7)
+            axes[idx].set_title(f'Distribution of {col}')
+            axes[idx].set_xlabel(col)
+            axes[idx].set_ylabel('Frequency')
+            axes[idx].grid(True, alpha=0.3)
+        
+        # Hide unused subplots
+        for idx in range(len(numeric_cols[:4]), 4):
+            axes[idx].set_visible(False)
+        
+        plt.tight_layout()
+        plt.savefig('distributions.png', dpi=150)
+        plt.close()
+        charts_created.append('distributions.png')
+    
+    # Categorical distributions
+    if categorical_cols:
+        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
+        axes = axes.flatten()
+        
+        for idx, col in enumerate(categorical_cols[:4]):
+            value_counts = df[col].value_counts().head(10)
+            axes[idx].barh(range(len(value_counts)), value_counts.values)
+            axes[idx].set_yticks(range(len(value_counts)))
+            axes[idx].set_yticklabels(value_counts.index)
+            axes[idx].set_title(f'Top Values in {col}')
+            axes[idx].set_xlabel('Count')
+            axes[idx].grid(True, alpha=0.3, axis='x')
+        
+        # Hide unused subplots
+        for idx in range(len(categorical_cols[:4]), 4):
+            axes[idx].set_visible(False)
+        
+        plt.tight_layout()
+        plt.savefig('categorical_distributions.png', dpi=150)
+        plt.close()
+        charts_created.append('categorical_distributions.png')
+    
+    # Summary of visualizations
+    if charts_created:
+        summary.append(f"\n📊 VISUALIZATIONS CREATED:")
+        for chart in charts_created:
+            summary.append(f"  ✓ {chart}")
+    
+    summary.append("\n" + "=" * 60)
+    summary.append("✅ COMPREHENSIVE ANALYSIS COMPLETE")
+    summary.append("=" * 60)
+    
+    return "\n".join(summary)
+
+
+if __name__ == "__main__":
+    # Test with sample data
+    import sys
+    if len(sys.argv) > 1:
+        file_path = sys.argv[1]
+    else:
+        file_path = "resources/sample.csv"
+    
+    print(summarize_csv(file_path))
+
--- a/skills/csv-data-summarizer/requirements.txt
+++ b/skills/csv-data-summarizer/requirements.txt
@@ -0,0 +1,4 @@
+pandas>=2.0.0
+matplotlib>=3.7.0
+seaborn>=0.12.0
+