Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/markitdown/references/structured_data.md
+++ b/skills/markitdown/references/structured_data.md
@@ -0,0 +1,575 @@
+# Structured Data Handling Reference
+
+This document provides detailed information about converting structured data formats (CSV, JSON, XML) to Markdown.
+
+## CSV Files
+
+Convert CSV (Comma-Separated Values) files to Markdown tables.
+
+### Basic CSV Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("data.csv")
+print(result.text_content)
+```
+
+### CSV to Markdown Table
+
+CSV files are automatically converted to Markdown table format:
+
+**Input CSV (`data.csv`):**
+```csv
+Name,Age,City
+Alice,30,New York
+Bob,25,Los Angeles
+Charlie,35,Chicago
+```
+
+**Output Markdown:**
+```markdown
+| Name    | Age | City        |
+|---------|-----|-------------|
+| Alice   | 30  | New York    |
+| Bob     | 25  | Los Angeles |
+| Charlie | 35  | Chicago     |
+```
+
+### CSV Conversion Features
+
+**What's preserved:**
+- All column headers
+- All data rows
+- Cell values (text and numbers)
+- Column structure
+
+**Formatting:**
+- Headers are bolded (Markdown table format)
+- Columns are aligned
+- Empty cells are preserved
+- Special characters are escaped
+
+### Large CSV Files
+
+For large CSV files:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Convert large CSV
+result = md.convert("large_dataset.csv")
+
+# Save to file instead of printing
+with open("output.md", "w") as f:
+    f.write(result.text_content)
+```
+
+**Performance considerations:**
+- Very large files may take time to process
+- Consider previewing first few rows for testing
+- Memory usage scales with file size
+- Very wide tables may not display well in all Markdown viewers
+
+### CSV with Special Characters
+
+CSV files containing special characters are handled automatically:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Handles UTF-8, special characters, quotes, etc.
+result = md.convert("international_data.csv")
+```
+
+### CSV Delimiters
+
+Standard CSV delimiters are supported:
+- Comma (`,`) - standard
+- Semicolon (`;`) - common in European formats
+- Tab (`\t`) - TSV files
+
+### Command-Line CSV Conversion
+
+```bash
+# Basic conversion
+markitdown data.csv -o data.md
+
+# Multiple CSV files
+for file in *.csv; do
+    markitdown "$file" -o "${file%.csv}.md"
+done
+```
+
+## JSON Files
+
+Convert JSON data to readable Markdown format.
+
+### Basic JSON Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("data.json")
+print(result.text_content)
+```
+
+### JSON Formatting
+
+JSON is converted to a readable, structured Markdown format:
+
+**Input JSON (`config.json`):**
+```json
+{
+  "name": "MyApp",
+  "version": "1.0.0",
+  "dependencies": {
+    "library1": "^2.0.0",
+    "library2": "^3.1.0"
+  },
+  "features": ["auth", "api", "database"]
+}
+```
+
+**Output Markdown:**
+```markdown
+## Configuration
+
+**name:** MyApp
+**version:** 1.0.0
+
+### dependencies
+- **library1:** ^2.0.0
+- **library2:** ^3.1.0
+
+### features
+- auth
+- api
+- database
+```
+
+### JSON Array Handling
+
+JSON arrays are converted to lists or tables:
+
+**Array of objects:**
+```json
+[
+  {"id": 1, "name": "Alice", "active": true},
+  {"id": 2, "name": "Bob", "active": false}
+]
+```
+
+**Converted to table:**
+```markdown
+| id | name  | active |
+|----|-------|--------|
+| 1  | Alice | true   |
+| 2  | Bob   | false  |
+```
+
+### Nested JSON Structures
+
+Nested JSON is converted with appropriate indentation and hierarchy:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Handles deeply nested structures
+result = md.convert("complex_config.json")
+print(result.text_content)
+```
+
+### JSON Lines (JSONL)
+
+For JSON Lines format (one JSON object per line):
+
+```python
+from markitdown import MarkItDown
+import json
+
+md = MarkItDown()
+
+# Read JSONL file
+with open("data.jsonl", "r") as f:
+    for line in f:
+        obj = json.loads(line)
+
+        # Convert to JSON temporarily
+        with open("temp.json", "w") as temp:
+            json.dump(obj, temp)
+
+        result = md.convert("temp.json")
+        print(result.text_content)
+        print("\n---\n")
+```
+
+### Large JSON Files
+
+For large JSON files:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Convert large JSON
+result = md.convert("large_data.json")
+
+# Save to file
+with open("output.md", "w") as f:
+    f.write(result.text_content)
+```
+
+## XML Files
+
+Convert XML documents to structured Markdown.
+
+### Basic XML Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("data.xml")
+print(result.text_content)
+```
+
+### XML Structure Preservation
+
+XML is converted to Markdown maintaining hierarchical structure:
+
+**Input XML (`book.xml`):**
+```xml
+<?xml version="1.0"?>
+<book>
+  <title>Example Book</title>
+  <author>John Doe</author>
+  <chapters>
+    <chapter id="1">
+      <title>Introduction</title>
+      <content>Chapter 1 content...</content>
+    </chapter>
+    <chapter id="2">
+      <title>Background</title>
+      <content>Chapter 2 content...</content>
+    </chapter>
+  </chapters>
+</book>
+```
+
+**Output Markdown:**
+```markdown
+# book
+
+## title
+Example Book
+
+## author
+John Doe
+
+## chapters
+
+### chapter (id: 1)
+#### title
+Introduction
+
+#### content
+Chapter 1 content...
+
+### chapter (id: 2)
+#### title
+Background
+
+#### content
+Chapter 2 content...
+```
+
+### XML Attributes
+
+XML attributes are preserved in the conversion:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("data.xml")
+# Attributes shown as (attr: value) in headings
+```
+
+### XML Namespaces
+
+XML namespaces are handled:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Handles xmlns and namespaced elements
+result = md.convert("namespaced.xml")
+```
+
+### XML Use Cases
+
+**Configuration files:**
+- Convert XML configs to readable format
+- Document system configurations
+- Compare configuration files
+
+**Data interchange:**
+- Convert XML APIs responses
+- Process XML data feeds
+- Transform between formats
+
+**Document processing:**
+- Convert DocBook to Markdown
+- Process SVG descriptions
+- Extract structured data
+
+## Structured Data Workflows
+
+### CSV Data Analysis Pipeline
+
+```python
+from markitdown import MarkItDown
+import pandas as pd
+
+md = MarkItDown()
+
+# Read CSV for analysis
+df = pd.read_csv("data.csv")
+
+# Do analysis
+summary = df.describe()
+
+# Convert both to Markdown
+original = md.convert("data.csv")
+
+# Save summary as CSV then convert
+summary.to_csv("summary.csv")
+summary_md = md.convert("summary.csv")
+
+print("## Original Data\n")
+print(original.text_content)
+print("\n## Statistical Summary\n")
+print(summary_md.text_content)
+```
+
+### JSON API Documentation
+
+```python
+from markitdown import MarkItDown
+import requests
+import json
+
+md = MarkItDown()
+
+# Fetch JSON from API
+response = requests.get("https://api.example.com/data")
+data = response.json()
+
+# Save as JSON
+with open("api_response.json", "w") as f:
+    json.dump(data, f, indent=2)
+
+# Convert to Markdown
+result = md.convert("api_response.json")
+
+# Create documentation
+doc = f"""# API Response Documentation
+
+## Endpoint
+GET https://api.example.com/data
+
+## Response
+{result.text_content}
+"""
+
+with open("api_docs.md", "w") as f:
+    f.write(doc)
+```
+
+### XML to Markdown Documentation
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Convert XML documentation
+xml_files = ["config.xml", "schema.xml", "data.xml"]
+
+for xml_file in xml_files:
+    result = md.convert(xml_file)
+
+    output_name = xml_file.replace('.xml', '.md')
+    with open(f"docs/{output_name}", "w") as f:
+        f.write(result.text_content)
+```
+
+### Multi-Format Data Processing
+
+```python
+from markitdown import MarkItDown
+import os
+
+md = MarkItDown()
+
+def convert_structured_data(directory):
+    """Convert all structured data files in directory."""
+    extensions = {'.csv', '.json', '.xml'}
+
+    for filename in os.listdir(directory):
+        ext = os.path.splitext(filename)[1]
+
+        if ext in extensions:
+            input_path = os.path.join(directory, filename)
+            result = md.convert(input_path)
+
+            # Save Markdown
+            output_name = filename.replace(ext, '.md')
+            output_path = os.path.join("markdown", output_name)
+
+            with open(output_path, 'w') as f:
+                f.write(result.text_content)
+
+            print(f"Converted: {filename} → {output_name}")
+
+# Process all structured data
+convert_structured_data("data")
+```
+
+### CSV to JSON to Markdown
+
+```python
+import pandas as pd
+from markitdown import MarkItDown
+import json
+
+md = MarkItDown()
+
+# Read CSV
+df = pd.read_csv("data.csv")
+
+# Convert to JSON
+json_data = df.to_dict(orient='records')
+with open("temp.json", "w") as f:
+    json.dump(json_data, f, indent=2)
+
+# Convert JSON to Markdown
+result = md.convert("temp.json")
+print(result.text_content)
+```
+
+### Database Export to Markdown
+
+```python
+from markitdown import MarkItDown
+import sqlite3
+import csv
+
+md = MarkItDown()
+
+# Export database query to CSV
+conn = sqlite3.connect("database.db")
+cursor = conn.execute("SELECT * FROM users")
+
+with open("users.csv", "w", newline='') as f:
+    writer = csv.writer(f)
+    writer.writerow([description[0] for description in cursor.description])
+    writer.writerows(cursor.fetchall())
+
+# Convert to Markdown
+result = md.convert("users.csv")
+print(result.text_content)
+```
+
+## Error Handling
+
+### CSV Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("data.csv")
+    print(result.text_content)
+except FileNotFoundError:
+    print("CSV file not found")
+except Exception as e:
+    print(f"CSV conversion error: {e}")
+    # Common issues: encoding problems, malformed CSV, delimiter issues
+```
+
+### JSON Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("data.json")
+    print(result.text_content)
+except Exception as e:
+    print(f"JSON conversion error: {e}")
+    # Common issues: invalid JSON syntax, encoding issues
+```
+
+### XML Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("data.xml")
+    print(result.text_content)
+except Exception as e:
+    print(f"XML conversion error: {e}")
+    # Common issues: malformed XML, encoding problems, namespace issues
+```
+
+## Best Practices
+
+### CSV Processing
+- Check delimiter before conversion
+- Verify encoding (UTF-8 recommended)
+- Handle large files with streaming if needed
+- Preview output for very wide tables
+
+### JSON Processing
+- Validate JSON before conversion
+- Consider pretty-printing complex structures
+- Handle circular references appropriately
+- Be aware of large array performance
+
+### XML Processing
+- Validate XML structure first
+- Handle namespaces consistently
+- Consider XPath for selective extraction
+- Be mindful of very deep nesting
+
+### Data Quality
+- Clean data before conversion when possible
+- Handle missing values appropriately
+- Verify special character handling
+- Test with representative samples
+
+### Performance
+- Process large files in batches
+- Use streaming for very large datasets
+- Monitor memory usage
+- Cache converted results when appropriate