11 KiB
Structured Data Handling Reference
This document provides detailed information about converting structured data formats (CSV, JSON, XML) to Markdown.
CSV Files
Convert CSV (Comma-Separated Values) files to Markdown tables.
Basic CSV Conversion
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.csv")
print(result.text_content)
CSV to Markdown Table
CSV files are automatically converted to Markdown table format:
Input CSV (data.csv):
Name,Age,City
Alice,30,New York
Bob,25,Los Angeles
Charlie,35,Chicago
Output Markdown:
| Name | Age | City |
|---------|-----|-------------|
| Alice | 30 | New York |
| Bob | 25 | Los Angeles |
| Charlie | 35 | Chicago |
CSV Conversion Features
What's preserved:
- All column headers
- All data rows
- Cell values (text and numbers)
- Column structure
Formatting:
- Headers are bolded (Markdown table format)
- Columns are aligned
- Empty cells are preserved
- Special characters are escaped
Large CSV Files
For large CSV files:
from markitdown import MarkItDown
md = MarkItDown()
# Convert large CSV
result = md.convert("large_dataset.csv")
# Save to file instead of printing
with open("output.md", "w") as f:
f.write(result.text_content)
Performance considerations:
- Very large files may take time to process
- Consider previewing first few rows for testing
- Memory usage scales with file size
- Very wide tables may not display well in all Markdown viewers
CSV with Special Characters
CSV files containing special characters are handled automatically:
from markitdown import MarkItDown
md = MarkItDown()
# Handles UTF-8, special characters, quotes, etc.
result = md.convert("international_data.csv")
CSV Delimiters
Standard CSV delimiters are supported:
- Comma (
,) - standard - Semicolon (
;) - common in European formats - Tab (
\t) - TSV files
Command-Line CSV Conversion
# Basic conversion
markitdown data.csv -o data.md
# Multiple CSV files
for file in *.csv; do
markitdown "$file" -o "${file%.csv}.md"
done
JSON Files
Convert JSON data to readable Markdown format.
Basic JSON Conversion
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.json")
print(result.text_content)
JSON Formatting
JSON is converted to a readable, structured Markdown format:
Input JSON (config.json):
{
"name": "MyApp",
"version": "1.0.0",
"dependencies": {
"library1": "^2.0.0",
"library2": "^3.1.0"
},
"features": ["auth", "api", "database"]
}
Output Markdown:
## Configuration
**name:** MyApp
**version:** 1.0.0
### dependencies
- **library1:** ^2.0.0
- **library2:** ^3.1.0
### features
- auth
- api
- database
JSON Array Handling
JSON arrays are converted to lists or tables:
Array of objects:
[
{"id": 1, "name": "Alice", "active": true},
{"id": 2, "name": "Bob", "active": false}
]
Converted to table:
| id | name | active |
|----|-------|--------|
| 1 | Alice | true |
| 2 | Bob | false |
Nested JSON Structures
Nested JSON is converted with appropriate indentation and hierarchy:
from markitdown import MarkItDown
md = MarkItDown()
# Handles deeply nested structures
result = md.convert("complex_config.json")
print(result.text_content)
JSON Lines (JSONL)
For JSON Lines format (one JSON object per line):
from markitdown import MarkItDown
import json
md = MarkItDown()
# Read JSONL file
with open("data.jsonl", "r") as f:
for line in f:
obj = json.loads(line)
# Convert to JSON temporarily
with open("temp.json", "w") as temp:
json.dump(obj, temp)
result = md.convert("temp.json")
print(result.text_content)
print("\n---\n")
Large JSON Files
For large JSON files:
from markitdown import MarkItDown
md = MarkItDown()
# Convert large JSON
result = md.convert("large_data.json")
# Save to file
with open("output.md", "w") as f:
f.write(result.text_content)
XML Files
Convert XML documents to structured Markdown.
Basic XML Conversion
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.xml")
print(result.text_content)
XML Structure Preservation
XML is converted to Markdown maintaining hierarchical structure:
Input XML (book.xml):
<?xml version="1.0"?>
<book>
<title>Example Book</title>
<author>John Doe</author>
<chapters>
<chapter id="1">
<title>Introduction</title>
<content>Chapter 1 content...</content>
</chapter>
<chapter id="2">
<title>Background</title>
<content>Chapter 2 content...</content>
</chapter>
</chapters>
</book>
Output Markdown:
# book
## title
Example Book
## author
John Doe
## chapters
### chapter (id: 1)
#### title
Introduction
#### content
Chapter 1 content...
### chapter (id: 2)
#### title
Background
#### content
Chapter 2 content...
XML Attributes
XML attributes are preserved in the conversion:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.xml")
# Attributes shown as (attr: value) in headings
XML Namespaces
XML namespaces are handled:
from markitdown import MarkItDown
md = MarkItDown()
# Handles xmlns and namespaced elements
result = md.convert("namespaced.xml")
XML Use Cases
Configuration files:
- Convert XML configs to readable format
- Document system configurations
- Compare configuration files
Data interchange:
- Convert XML APIs responses
- Process XML data feeds
- Transform between formats
Document processing:
- Convert DocBook to Markdown
- Process SVG descriptions
- Extract structured data
Structured Data Workflows
CSV Data Analysis Pipeline
from markitdown import MarkItDown
import pandas as pd
md = MarkItDown()
# Read CSV for analysis
df = pd.read_csv("data.csv")
# Do analysis
summary = df.describe()
# Convert both to Markdown
original = md.convert("data.csv")
# Save summary as CSV then convert
summary.to_csv("summary.csv")
summary_md = md.convert("summary.csv")
print("## Original Data\n")
print(original.text_content)
print("\n## Statistical Summary\n")
print(summary_md.text_content)
JSON API Documentation
from markitdown import MarkItDown
import requests
import json
md = MarkItDown()
# Fetch JSON from API
response = requests.get("https://api.example.com/data")
data = response.json()
# Save as JSON
with open("api_response.json", "w") as f:
json.dump(data, f, indent=2)
# Convert to Markdown
result = md.convert("api_response.json")
# Create documentation
doc = f"""# API Response Documentation
## Endpoint
GET https://api.example.com/data
## Response
{result.text_content}
"""
with open("api_docs.md", "w") as f:
f.write(doc)
XML to Markdown Documentation
from markitdown import MarkItDown
md = MarkItDown()
# Convert XML documentation
xml_files = ["config.xml", "schema.xml", "data.xml"]
for xml_file in xml_files:
result = md.convert(xml_file)
output_name = xml_file.replace('.xml', '.md')
with open(f"docs/{output_name}", "w") as f:
f.write(result.text_content)
Multi-Format Data Processing
from markitdown import MarkItDown
import os
md = MarkItDown()
def convert_structured_data(directory):
"""Convert all structured data files in directory."""
extensions = {'.csv', '.json', '.xml'}
for filename in os.listdir(directory):
ext = os.path.splitext(filename)[1]
if ext in extensions:
input_path = os.path.join(directory, filename)
result = md.convert(input_path)
# Save Markdown
output_name = filename.replace(ext, '.md')
output_path = os.path.join("markdown", output_name)
with open(output_path, 'w') as f:
f.write(result.text_content)
print(f"Converted: {filename} → {output_name}")
# Process all structured data
convert_structured_data("data")
CSV to JSON to Markdown
import pandas as pd
from markitdown import MarkItDown
import json
md = MarkItDown()
# Read CSV
df = pd.read_csv("data.csv")
# Convert to JSON
json_data = df.to_dict(orient='records')
with open("temp.json", "w") as f:
json.dump(json_data, f, indent=2)
# Convert JSON to Markdown
result = md.convert("temp.json")
print(result.text_content)
Database Export to Markdown
from markitdown import MarkItDown
import sqlite3
import csv
md = MarkItDown()
# Export database query to CSV
conn = sqlite3.connect("database.db")
cursor = conn.execute("SELECT * FROM users")
with open("users.csv", "w", newline='') as f:
writer = csv.writer(f)
writer.writerow([description[0] for description in cursor.description])
writer.writerows(cursor.fetchall())
# Convert to Markdown
result = md.convert("users.csv")
print(result.text_content)
Error Handling
CSV Errors
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("data.csv")
print(result.text_content)
except FileNotFoundError:
print("CSV file not found")
except Exception as e:
print(f"CSV conversion error: {e}")
# Common issues: encoding problems, malformed CSV, delimiter issues
JSON Errors
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("data.json")
print(result.text_content)
except Exception as e:
print(f"JSON conversion error: {e}")
# Common issues: invalid JSON syntax, encoding issues
XML Errors
from markitdown import MarkItDown
md = MarkItDown()
try:
result = md.convert("data.xml")
print(result.text_content)
except Exception as e:
print(f"XML conversion error: {e}")
# Common issues: malformed XML, encoding problems, namespace issues
Best Practices
CSV Processing
- Check delimiter before conversion
- Verify encoding (UTF-8 recommended)
- Handle large files with streaming if needed
- Preview output for very wide tables
JSON Processing
- Validate JSON before conversion
- Consider pretty-printing complex structures
- Handle circular references appropriately
- Be aware of large array performance
XML Processing
- Validate XML structure first
- Handle namespaces consistently
- Consider XPath for selective extraction
- Be mindful of very deep nesting
Data Quality
- Clean data before conversion when possible
- Handle missing values appropriately
- Verify special character handling
- Test with representative samples
Performance
- Process large files in batches
- Use streaming for very large datasets
- Monitor memory usage
- Cache converted results when appropriate