Initial commit

2025-11-29 18:51:12 +08:00
commit cf82920fd2
11 changed files with 340 additions and 0 deletions
--- a/.claude-plugin/plugin.json
+++ b/.claude-plugin/plugin.json
@@ -0,0 +1,15 @@
+{
+  "name": "dataset-splitter",
+  "description": "Split datasets for training, validation, and testing",
+  "version": "1.0.0",
+  "author": {
+    "name": "Claude Code Plugins",
+    "email": "[email protected]"
+  },
+  "skills": [
+    "./skills"
+  ],
+  "commands": [
+    "./commands"
+  ]
+}
--- a/README.md
+++ b/README.md
@@ -0,0 +1,3 @@
+# dataset-splitter
+
+Split datasets for training, validation, and testing
--- a/commands/split-data.md
+++ b/commands/split-data.md
@@ -0,0 +1,15 @@
+---
+description: Execute AI/ML task with intelligent automation
+---
+
+# AI/ML Task Executor
+
+You are an AI/ML specialist. When this command is invoked:
+
+1. Analyze the current context and requirements
+2. Generate appropriate code for the ML task
+3. Include data validation and error handling
+4. Provide performance metrics and insights
+5. Save artifacts and generate documentation
+
+Support modern ML frameworks and best practices.
--- a/plugin.lock.json
+++ b/plugin.lock.json
@@ -0,0 +1,73 @@
+{
+  "$schema": "internal://schemas/plugin.lock.v1.json",
+  "pluginId": "gh:jeremylongshore/claude-code-plugins-plus:plugins/ai-ml/dataset-splitter",
+  "normalized": {
+    "repo": null,
+    "ref": "refs/tags/v20251128.0",
+    "commit": "8015104daeb2630cd7dce2b92cc2a64fcb925afe",
+    "treeHash": "81031f8eb7749e120d175ad70979e1eb7540dc837cab3aafb5d1420ca2984bdd",
+    "generatedAt": "2025-11-28T10:18:22.470033Z",
+    "toolVersion": "publish_plugins.py@0.2.0"
+  },
+  "origin": {
+    "remote": "git@github.com:zhongweili/42plugin-data.git",
+    "branch": "master",
+    "commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
+    "repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
+  },
+  "manifest": {
+    "name": "dataset-splitter",
+    "description": "Split datasets for training, validation, and testing",
+    "version": "1.0.0"
+  },
+  "content": {
+    "files": [
+      {
+        "path": "README.md",
+        "sha256": "f448a7297b2f7d3739720057d7a86b5f19c071b64b8c90340ece5bda333e7e4a"
+      },
+      {
+        "path": ".claude-plugin/plugin.json",
+        "sha256": "83c88f136b61b0b5f399e43eb8deab248ff6875f7a8805e3012dbede2d70a445"
+      },
+      {
+        "path": "commands/split-data.md",
+        "sha256": "043efb83e2f02fc6d0869c8a3a7388d6e49f6c809292b93dd6a97a1b142e5647"
+      },
+      {
+        "path": "skills/dataset-splitter/SKILL.md",
+        "sha256": "d564400cae0a3222ca04519790a6a55c5fdd86f92b01b839e922d735d1146875"
+      },
+      {
+        "path": "skills/dataset-splitter/references/README.md",
+        "sha256": "86464982a32e0f3af6f69664b4b435750080c084dcd9603b960dbf5c57250e9e"
+      },
+      {
+        "path": "skills/dataset-splitter/scripts/README.md",
+        "sha256": "bd631bc42cd5cb6489f6fdced163520924928e3578bdcd0afb356ee1f37473a1"
+      },
+      {
+        "path": "skills/dataset-splitter/assets/example_dataset.csv",
+        "sha256": "ad5ff3fb44a7afa4fa6622cc6a7e9820a6a7652bae8d1a927f34b8a86a8234ff"
+      },
+      {
+        "path": "skills/dataset-splitter/assets/README.md",
+        "sha256": "5b1722822e842c092cef9197ef3e5885df0c9ec5012dcc4dbc440e878bb13581"
+      },
+      {
+        "path": "skills/dataset-splitter/assets/split_data_config.yaml",
+        "sha256": "46efaaa82e9f99758f607257cdb9db3e1f40950431099bc636ca2b3805f88f59"
+      },
+      {
+        "path": "skills/dataset-splitter/assets/dataset_schema.json",
+        "sha256": "66f770dc684486b0187315b04222a98bcc0bff228dfc03c5cadb8f129bc42289"
+      }
+    ],
+    "dirSha256": "81031f8eb7749e120d175ad70979e1eb7540dc837cab3aafb5d1420ca2984bdd"
+  },
+  "security": {
+    "scannedAt": null,
+    "scannerVersion": null,
+    "flags": []
+  }
+}
--- a/skills/dataset-splitter/SKILL.md
+++ b/skills/dataset-splitter/SKILL.md
@@ -0,0 +1,52 @@
+---
+name: splitting-datasets
+description: |
+  This skill enables Claude to split datasets into training, validation, and testing sets. It is useful when preparing data for machine learning model development. Use this skill when the user requests to split a dataset, create train-test splits, or needs data partitioning for model training. The skill is triggered by terms like "split dataset," "train-test split," "validation set," or "data partitioning."
+allowed-tools: Read, Write, Edit, Grep, Glob, Bash
+version: 1.0.0
+---
+
+## Overview
+
+This skill automates the process of dividing a dataset into subsets for training, validating, and testing machine learning models. It ensures proper data preparation and facilitates robust model evaluation.
+
+## How It Works
+
+1. **Analyze Request**: The skill analyzes the user's request to determine the dataset to be split and the desired proportions for each subset.
+2. **Generate Code**: Based on the request, the skill generates Python code utilizing standard ML libraries to perform the data splitting.
+3. **Execute Splitting**: The code is executed to split the dataset into training, validation, and testing sets according to the specified ratios.
+
+## When to Use This Skill
+
+This skill activates when you need to:
+- Prepare a dataset for machine learning model training.
+- Create training, validation, and testing sets.
+- Partition data to evaluate model performance.
+
+## Examples
+
+### Example 1: Splitting a CSV file
+
+User request: "Split the data in 'my_data.csv' into 70% training, 15% validation, and 15% testing sets."
+
+The skill will:
+1. Generate Python code to read the 'my_data.csv' file.
+2. Execute the code to split the data according to the specified proportions, creating 'train.csv', 'validation.csv', and 'test.csv' files.
+
+### Example 2: Creating a Train-Test Split
+
+User request: "Create a train-test split of 'large_dataset.csv' with an 80/20 ratio."
+
+The skill will:
+1. Generate Python code to load 'large_dataset.csv'.
+2. Execute the code to split the dataset into 80% training and 20% testing sets, saving them as 'train.csv' and 'test.csv'.
+
+## Best Practices
+
+- **Data Integrity**: Verify that the splitting process maintains the integrity of the data, ensuring no data loss or corruption.
+- **Stratification**: Consider stratification when splitting imbalanced datasets to maintain class distributions in each subset.
+- **Randomization**: Ensure the splitting process is randomized to avoid bias in the resulting datasets.
+
+## Integration
+
+This skill can be integrated with other data processing and model training tools within the Claude Code ecosystem to create a complete machine learning workflow.
--- a/skills/dataset-splitter/assets/README.md
+++ b/skills/dataset-splitter/assets/README.md
@@ -0,0 +1,7 @@
+# Assets
+
+Bundled resources for dataset-splitter skill
+
+- [ ] example_dataset.csv: A small example dataset for demonstration purposes.
+- [ ] split_data_config.yaml: Example configuration file for specifying dataset splitting parameters.
+- [ ] dataset_schema.json: Example JSON schema for dataset validation.
--- a/skills/dataset-splitter/assets/dataset_schema.json
+++ b/skills/dataset-splitter/assets/dataset_schema.json
@@ -0,0 +1,97 @@
+{
+  "_comment": "This JSON schema defines the expected structure for a dataset to be split.",
+  "$schema": "http://json-schema.org/draft-07/schema#",
+  "title": "Dataset Schema",
+  "description": "Schema for validating dataset structure before splitting.",
+  "type": "object",
+  "properties": {
+    "dataset_name": {
+      "type": "string",
+      "description": "Name of the dataset",
+      "example": "iris_dataset"
+    },
+    "file_path": {
+      "type": "string",
+      "description": "Path to the dataset file (e.g., CSV, JSON)",
+      "example": "data/iris.csv"
+    },
+    "file_type": {
+      "type": "string",
+      "description": "Type of the dataset file",
+      "enum": ["csv", "json", "parquet"],
+      "example": "csv"
+    },
+    "separator": {
+      "type": "string",
+      "description": "Separator character for CSV files (e.g., ',', ';')",
+      "default": ",",
+      "example": ","
+    },
+    "target_column": {
+      "type": "string",
+      "description": "Name of the target/label column",
+      "example": "species"
+    },
+    "features": {
+      "type": "array",
+      "description": "List of feature column names",
+      "items": {
+        "type": "string"
+      },
+      "example": ["sepal_length", "sepal_width", "petal_length", "petal_width"]
+    },
+    "split_ratios": {
+      "type": "object",
+      "description": "Ratios for splitting the dataset into training, validation, and testing sets.",
+      "properties": {
+        "train": {
+          "type": "number",
+          "description": "Ratio for the training set (e.g., 0.7 for 70%)",
+          "minimum": 0,
+          "maximum": 1,
+          "example": 0.7
+        },
+        "validation": {
+          "type": "number",
+          "description": "Ratio for the validation set (e.g., 0.15 for 15%)",
+          "minimum": 0,
+          "maximum": 1,
+          "example": 0.15
+        },
+        "test": {
+          "type": "number",
+          "description": "Ratio for the testing set (e.g., 0.15 for 15%)",
+          "minimum": 0,
+          "maximum": 1,
+          "example": 0.15
+        }
+      },
+      "required": ["train", "validation", "test"]
+    },
+    "random_state": {
+      "type": "integer",
+      "description": "Random seed for reproducibility",
+      "default": 42,
+      "example": 42
+    },
+    "stratify": {
+      "type": "boolean",
+      "description": "Whether to stratify the split based on the target column",
+      "default": true,
+      "example": true
+    },
+    "output_directory": {
+      "type": "string",
+      "description": "Directory to save the split datasets",
+      "default": "split_data",
+      "example": "split_data"
+    },
+    "file_name_prefix": {
+      "type": "string",
+      "description": "Prefix for the output file names",
+      "default": "split",
+      "example": "split"
+    }
+  },
+  "required": ["dataset_name", "file_path", "file_type", "target_column", "features", "split_ratios"]
+}
--- a/skills/dataset-splitter/assets/example_dataset.csv
+++ b/skills/dataset-splitter/assets/example_dataset.csv
@@ -0,0 +1,23 @@
+# example_dataset.csv
+# This is a small example dataset to demonstrate the dataset-splitter plugin.
+# It contains [NUMBER] rows and [NUMBER] columns.
+# The first row is the header row, defining the column names.
+# You can replace this with your own dataset.
+# To use this dataset with the plugin, save it as a .csv file and place it in the same directory as your plugin execution.
+
+# Column descriptions:
+# - feature1: [DESCRIPTION OF FEATURE 1]
+# - feature2: [DESCRIPTION OF FEATURE 2]
+# - target: [DESCRIPTION OF TARGET VARIABLE]
+
+feature1,feature2,target
+1.0,2.0,0
+3.0,4.0,1
+5.0,6.0,0
+7.0,8.0,1
+9.0,10.0,0
+11.0,12.0,1
+13.0,14.0,0
+15.0,16.0,1
+17.0,18.0,0
+19.0,20.0,1
--- a/skills/dataset-splitter/assets/split_data_config.yaml
+++ b/skills/dataset-splitter/assets/split_data_config.yaml
@@ -0,0 +1,42 @@
+# Configuration file for dataset splitting
+
+# Input dataset parameters
+input_dataset:
+  path: "REPLACE_ME/path/to/your/dataset.csv"  # Path to the input dataset file
+  format: "csv"  # Dataset format (e.g., csv, parquet, json)
+  header: True   # Whether the dataset has a header row (True/False)
+  separator: ","  # Separator used in the dataset (e.g., comma, tab, semicolon)
+
+# Splitting ratios for training, validation, and testing sets
+split_ratios:
+  train: 0.7   # Percentage of data for training (0.0 - 1.0)
+  validation: 0.15  # Percentage of data for validation (0.0 - 1.0)
+  test: 0.15  # Percentage of data for testing (0.0 - 1.0)
+
+# Random seed for reproducibility
+random_seed: 42  # Integer value for the random seed
+
+# Output directory for split datasets
+output_directory: "REPLACE_ME/path/to/output/directory"  # Directory where the split datasets will be saved
+
+# Naming convention for output files
+output_names:
+  train: "train.csv"   # Name of the training dataset file
+  validation: "validation.csv" # Name of the validation dataset file
+  test: "test.csv"  # Name of the test dataset file
+
+# Optional features (e.g., stratification)
+features:
+  stratify: False  # Whether to stratify the split based on a specific column (True/False)
+  stratify_column: "YOUR_VALUE_HERE" # Name of the column to use for stratification (if stratify is True)
+  shuffle: True   # Whether to shuffle the data before splitting (True/False)
+
+# Error handling configuration
+error_handling:
+  on_error: "warn"  # Action to take on error ("warn", "ignore", "raise") - default is warn
+  log_errors: True # Whether to log error messages to a file
+
+# Logging configuration
+logging:
+  level: "INFO"  # Logging level (e.g., DEBUG, INFO, WARNING, ERROR, CRITICAL)
+  file: "split_data.log" # Log file path
--- a/skills/dataset-splitter/references/README.md
+++ b/skills/dataset-splitter/references/README.md
@@ -0,0 +1,7 @@
+# References
+
+Bundled resources for dataset-splitter skill
+
+- [ ] dataset_splitting_best_practices.md: Document outlining best practices for dataset splitting, including considerations for stratified sampling, handling imbalanced datasets, and avoiding data leakage.
+- [ ] sklearn_train_test_split_docs.md: Excerpt from scikit-learn documentation on train_test_split function, detailing parameters and usage.
+- [ ] common_dataset_formats.md: Documentation on common dataset formats (CSV, Parquet, etc.) and how to handle them.
--- a/skills/dataset-splitter/scripts/README.md
+++ b/skills/dataset-splitter/scripts/README.md
@@ -0,0 +1,6 @@
+# Scripts
+
+Bundled resources for dataset-splitter skill
+
+- [ ] split_data.py: Script to perform the actual dataset splitting, taking parameters for split ratios and file paths.
+- [ ] validate_data.py: Script to validate the resulting datasets (e.g., check for data leakage, ensure correct data types).