Initial commit
This commit is contained in:
149
skills/csv-data-summarizer/SKILL.md
Normal file
149
skills/csv-data-summarizer/SKILL.md
Normal file
@@ -0,0 +1,149 @@
|
||||
---
|
||||
name: csv-data-summarizer
|
||||
description: Analyzes CSV files, generates summary stats, and plots quick visualizations using Python and pandas.
|
||||
metadata:
|
||||
version: 2.1.0
|
||||
dependencies: python>=3.8, pandas>=2.0.0, matplotlib>=3.7.0, seaborn>=0.12.0
|
||||
---
|
||||
|
||||
# CSV Data Summarizer
|
||||
|
||||
This Skill analyzes CSV files and provides comprehensive summaries with statistical insights and visualizations.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Claude should use this Skill whenever the user:
|
||||
- Uploads or references a CSV file
|
||||
- Asks to summarize, analyze, or visualize tabular data
|
||||
- Requests insights from CSV data
|
||||
- Wants to understand data structure and quality
|
||||
|
||||
## How It Works
|
||||
|
||||
## ⚠️ CRITICAL BEHAVIOR REQUIREMENT ⚠️
|
||||
|
||||
**DO NOT ASK THE USER WHAT THEY WANT TO DO WITH THE DATA.**
|
||||
**DO NOT OFFER OPTIONS OR CHOICES.**
|
||||
**DO NOT SAY "What would you like me to help you with?"**
|
||||
**DO NOT LIST POSSIBLE ANALYSES.**
|
||||
|
||||
**IMMEDIATELY AND AUTOMATICALLY:**
|
||||
1. Run the comprehensive analysis
|
||||
2. Generate ALL relevant visualizations
|
||||
3. Present complete results
|
||||
4. NO questions, NO options, NO waiting for user input
|
||||
|
||||
**THE USER WANTS A FULL ANALYSIS RIGHT AWAY - JUST DO IT.**
|
||||
|
||||
### Automatic Analysis Steps:
|
||||
|
||||
**The skill intelligently adapts to different data types and industries by inspecting the data first, then determining what analyses are most relevant.**
|
||||
|
||||
1. **Load and inspect** the CSV file into pandas DataFrame
|
||||
2. **Identify data structure** - column types, date columns, numeric columns, categories
|
||||
3. **Determine relevant analyses** based on what's actually in the data:
|
||||
- **Sales/E-commerce data** (order dates, revenue, products): Time-series trends, revenue analysis, product performance
|
||||
- **Customer data** (demographics, segments, regions): Distribution analysis, segmentation, geographic patterns
|
||||
- **Financial data** (transactions, amounts, dates): Trend analysis, statistical summaries, correlations
|
||||
- **Operational data** (timestamps, metrics, status): Time-series, performance metrics, distributions
|
||||
- **Survey data** (categorical responses, ratings): Frequency analysis, cross-tabulations, distributions
|
||||
- **Generic tabular data**: Adapts based on column types found
|
||||
|
||||
4. **Only create visualizations that make sense** for the specific dataset:
|
||||
- Time-series plots ONLY if date/timestamp columns exist
|
||||
- Correlation heatmaps ONLY if multiple numeric columns exist
|
||||
- Category distributions ONLY if categorical columns exist
|
||||
- Histograms for numeric distributions when relevant
|
||||
|
||||
5. **Generate comprehensive output** automatically including:
|
||||
- Data overview (rows, columns, types)
|
||||
- Key statistics and metrics relevant to the data type
|
||||
- Missing data analysis
|
||||
- Multiple relevant visualizations (only those that apply)
|
||||
- Actionable insights based on patterns found in THIS specific dataset
|
||||
|
||||
6. **Present everything** in one complete analysis - no follow-up questions
|
||||
|
||||
**Example adaptations:**
|
||||
- Healthcare data with patient IDs → Focus on demographics, treatment patterns, temporal trends
|
||||
- Inventory data with stock levels → Focus on quantity distributions, reorder patterns, SKU analysis
|
||||
- Web analytics with timestamps → Focus on traffic patterns, conversion metrics, time-of-day analysis
|
||||
- Survey responses → Focus on response distributions, demographic breakdowns, sentiment patterns
|
||||
|
||||
### Behavior Guidelines
|
||||
|
||||
✅ **CORRECT APPROACH - SAY THIS:**
|
||||
- "I'll analyze this data comprehensively right now."
|
||||
- "Here's the complete analysis with visualizations:"
|
||||
- "I've identified this as [type] data and generated relevant insights:"
|
||||
- Then IMMEDIATELY show the full analysis
|
||||
|
||||
✅ **DO:**
|
||||
- Immediately run the analysis script
|
||||
- Generate ALL relevant charts automatically
|
||||
- Provide complete insights without being asked
|
||||
- Be thorough and complete in first response
|
||||
- Act decisively without asking permission
|
||||
|
||||
❌ **NEVER SAY THESE PHRASES:**
|
||||
- "What would you like to do with this data?"
|
||||
- "What would you like me to help you with?"
|
||||
- "Here are some common options:"
|
||||
- "Let me know what you'd like help with"
|
||||
- "I can create a comprehensive analysis if you'd like!"
|
||||
- Any sentence ending with "?" asking for user direction
|
||||
- Any list of options or choices
|
||||
- Any conditional "I can do X if you want"
|
||||
|
||||
❌ **FORBIDDEN BEHAVIORS:**
|
||||
- Asking what the user wants
|
||||
- Listing options for the user to choose from
|
||||
- Waiting for user direction before analyzing
|
||||
- Providing partial analysis that requires follow-up
|
||||
- Describing what you COULD do instead of DOING it
|
||||
|
||||
### Usage
|
||||
|
||||
The Skill provides a Python function `summarize_csv(file_path)` that:
|
||||
- Accepts a path to a CSV file
|
||||
- Returns a comprehensive text summary with statistics
|
||||
- Generates multiple visualizations automatically based on data structure
|
||||
|
||||
### Example Prompts
|
||||
|
||||
> "Here's `sales_data.csv`. Can you summarize this file?"
|
||||
|
||||
> "Analyze this customer data CSV and show me trends."
|
||||
|
||||
> "What insights can you find in `orders.csv`?"
|
||||
|
||||
### Example Output
|
||||
|
||||
**Dataset Overview**
|
||||
- 5,000 rows × 8 columns
|
||||
- 3 numeric columns, 1 date column
|
||||
|
||||
**Summary Statistics**
|
||||
- Average order value: $58.2
|
||||
- Standard deviation: $12.4
|
||||
- Missing values: 2% (100 cells)
|
||||
|
||||
**Insights**
|
||||
- Sales show upward trend over time
|
||||
- Peak activity in Q4
|
||||
*(Attached: trend plot)*
|
||||
|
||||
## Files
|
||||
|
||||
- `analyze.py` - Core analysis logic
|
||||
- `requirements.txt` - Python dependencies
|
||||
- `resources/sample.csv` - Example dataset for testing
|
||||
- `resources/README.md` - Additional documentation
|
||||
|
||||
## Notes
|
||||
|
||||
- Automatically detects date columns (columns containing 'date' in name)
|
||||
- Handles missing data gracefully
|
||||
- Generates visualizations only when date columns are present
|
||||
- All numeric columns are included in statistical summary
|
||||
|
||||
182
skills/csv-data-summarizer/analyze.py
Normal file
182
skills/csv-data-summarizer/analyze.py
Normal file
@@ -0,0 +1,182 @@
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
import seaborn as sns
|
||||
from pathlib import Path
|
||||
|
||||
def summarize_csv(file_path):
|
||||
"""
|
||||
Comprehensively analyzes a CSV file and generates multiple visualizations.
|
||||
|
||||
Args:
|
||||
file_path (str): Path to the CSV file
|
||||
|
||||
Returns:
|
||||
str: Formatted comprehensive analysis of the dataset
|
||||
"""
|
||||
df = pd.read_csv(file_path)
|
||||
summary = []
|
||||
charts_created = []
|
||||
|
||||
# Basic info
|
||||
summary.append("=" * 60)
|
||||
summary.append("📊 DATA OVERVIEW")
|
||||
summary.append("=" * 60)
|
||||
summary.append(f"Rows: {df.shape[0]:,} | Columns: {df.shape[1]}")
|
||||
summary.append(f"\nColumns: {', '.join(df.columns.tolist())}")
|
||||
|
||||
# Data types
|
||||
summary.append(f"\n📋 DATA TYPES:")
|
||||
for col, dtype in df.dtypes.items():
|
||||
summary.append(f" • {col}: {dtype}")
|
||||
|
||||
# Missing data analysis
|
||||
missing = df.isnull().sum().sum()
|
||||
missing_pct = (missing / (df.shape[0] * df.shape[1])) * 100
|
||||
summary.append(f"\n🔍 DATA QUALITY:")
|
||||
if missing:
|
||||
summary.append(f"Missing values: {missing:,} ({missing_pct:.2f}% of total data)")
|
||||
summary.append("Missing by column:")
|
||||
for col in df.columns:
|
||||
col_missing = df[col].isnull().sum()
|
||||
if col_missing > 0:
|
||||
col_pct = (col_missing / len(df)) * 100
|
||||
summary.append(f" • {col}: {col_missing:,} ({col_pct:.1f}%)")
|
||||
else:
|
||||
summary.append("✓ No missing values - dataset is complete!")
|
||||
|
||||
# Numeric analysis
|
||||
numeric_cols = df.select_dtypes(include='number').columns.tolist()
|
||||
if numeric_cols:
|
||||
summary.append(f"\n📈 NUMERICAL ANALYSIS:")
|
||||
summary.append(str(df[numeric_cols].describe()))
|
||||
|
||||
# Correlations if multiple numeric columns
|
||||
if len(numeric_cols) > 1:
|
||||
summary.append(f"\n🔗 CORRELATIONS:")
|
||||
corr_matrix = df[numeric_cols].corr()
|
||||
summary.append(str(corr_matrix))
|
||||
|
||||
# Create correlation heatmap
|
||||
plt.figure(figsize=(10, 8))
|
||||
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
|
||||
square=True, linewidths=1)
|
||||
plt.title('Correlation Heatmap')
|
||||
plt.tight_layout()
|
||||
plt.savefig('correlation_heatmap.png', dpi=150)
|
||||
plt.close()
|
||||
charts_created.append('correlation_heatmap.png')
|
||||
|
||||
# Categorical analysis
|
||||
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
|
||||
categorical_cols = [c for c in categorical_cols if 'id' not in c.lower()]
|
||||
|
||||
if categorical_cols:
|
||||
summary.append(f"\n📊 CATEGORICAL ANALYSIS:")
|
||||
for col in categorical_cols[:5]: # Limit to first 5
|
||||
value_counts = df[col].value_counts()
|
||||
summary.append(f"\n{col}:")
|
||||
for val, count in value_counts.head(10).items():
|
||||
pct = (count / len(df)) * 100
|
||||
summary.append(f" • {val}: {count:,} ({pct:.1f}%)")
|
||||
|
||||
# Time series analysis
|
||||
date_cols = [c for c in df.columns if 'date' in c.lower() or 'time' in c.lower()]
|
||||
if date_cols:
|
||||
summary.append(f"\n📅 TIME SERIES ANALYSIS:")
|
||||
date_col = date_cols[0]
|
||||
df[date_col] = pd.to_datetime(df[date_col], errors='coerce')
|
||||
|
||||
date_range = df[date_col].max() - df[date_col].min()
|
||||
summary.append(f"Date range: {df[date_col].min()} to {df[date_col].max()}")
|
||||
summary.append(f"Span: {date_range.days} days")
|
||||
|
||||
# Create time-series plots for numeric columns
|
||||
if numeric_cols:
|
||||
fig, axes = plt.subplots(min(3, len(numeric_cols)), 1,
|
||||
figsize=(12, 4 * min(3, len(numeric_cols))))
|
||||
if len(numeric_cols) == 1:
|
||||
axes = [axes]
|
||||
|
||||
for idx, num_col in enumerate(numeric_cols[:3]):
|
||||
ax = axes[idx] if len(numeric_cols) > 1 else axes[0]
|
||||
daily_data = df.groupby(date_col)[num_col].agg(['mean', 'sum', 'count'])
|
||||
daily_data['mean'].plot(ax=ax, label='Average', linewidth=2)
|
||||
ax.set_title(f'{num_col} Over Time')
|
||||
ax.set_xlabel('Date')
|
||||
ax.set_ylabel(num_col)
|
||||
ax.legend()
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
plt.tight_layout()
|
||||
plt.savefig('time_series_analysis.png', dpi=150)
|
||||
plt.close()
|
||||
charts_created.append('time_series_analysis.png')
|
||||
|
||||
# Distribution plots for numeric columns
|
||||
if numeric_cols:
|
||||
n_cols = min(4, len(numeric_cols))
|
||||
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
|
||||
axes = axes.flatten()
|
||||
|
||||
for idx, col in enumerate(numeric_cols[:4]):
|
||||
axes[idx].hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7)
|
||||
axes[idx].set_title(f'Distribution of {col}')
|
||||
axes[idx].set_xlabel(col)
|
||||
axes[idx].set_ylabel('Frequency')
|
||||
axes[idx].grid(True, alpha=0.3)
|
||||
|
||||
# Hide unused subplots
|
||||
for idx in range(len(numeric_cols[:4]), 4):
|
||||
axes[idx].set_visible(False)
|
||||
|
||||
plt.tight_layout()
|
||||
plt.savefig('distributions.png', dpi=150)
|
||||
plt.close()
|
||||
charts_created.append('distributions.png')
|
||||
|
||||
# Categorical distributions
|
||||
if categorical_cols:
|
||||
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
|
||||
axes = axes.flatten()
|
||||
|
||||
for idx, col in enumerate(categorical_cols[:4]):
|
||||
value_counts = df[col].value_counts().head(10)
|
||||
axes[idx].barh(range(len(value_counts)), value_counts.values)
|
||||
axes[idx].set_yticks(range(len(value_counts)))
|
||||
axes[idx].set_yticklabels(value_counts.index)
|
||||
axes[idx].set_title(f'Top Values in {col}')
|
||||
axes[idx].set_xlabel('Count')
|
||||
axes[idx].grid(True, alpha=0.3, axis='x')
|
||||
|
||||
# Hide unused subplots
|
||||
for idx in range(len(categorical_cols[:4]), 4):
|
||||
axes[idx].set_visible(False)
|
||||
|
||||
plt.tight_layout()
|
||||
plt.savefig('categorical_distributions.png', dpi=150)
|
||||
plt.close()
|
||||
charts_created.append('categorical_distributions.png')
|
||||
|
||||
# Summary of visualizations
|
||||
if charts_created:
|
||||
summary.append(f"\n📊 VISUALIZATIONS CREATED:")
|
||||
for chart in charts_created:
|
||||
summary.append(f" ✓ {chart}")
|
||||
|
||||
summary.append("\n" + "=" * 60)
|
||||
summary.append("✅ COMPREHENSIVE ANALYSIS COMPLETE")
|
||||
summary.append("=" * 60)
|
||||
|
||||
return "\n".join(summary)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Test with sample data
|
||||
import sys
|
||||
if len(sys.argv) > 1:
|
||||
file_path = sys.argv[1]
|
||||
else:
|
||||
file_path = "resources/sample.csv"
|
||||
|
||||
print(summarize_csv(file_path))
|
||||
|
||||
4
skills/csv-data-summarizer/requirements.txt
Normal file
4
skills/csv-data-summarizer/requirements.txt
Normal file
@@ -0,0 +1,4 @@
|
||||
pandas>=2.0.0
|
||||
matplotlib>=3.7.0
|
||||
seaborn>=0.12.0
|
||||
|
||||
Reference in New Issue
Block a user