6.5 KiB
6.5 KiB
Document Conversion Examples
Comprehensive examples for converting various document formats to markdown.
Basic Document Conversions
PDF to Markdown
# Simple PDF conversion
markitdown "document.pdf" > output.md
# WSL path example
markitdown "/mnt/c/Users/username/Documents/report.pdf" > report.md
# With explicit output
markitdown "slides.pdf" > "slides.md"
Word Documents to Markdown
# Modern Word document (.docx)
markitdown "document.docx" > output.md
# Legacy Word document (.doc)
markitdown "legacy-doc.doc" > output.md
# Preserve directory structure
markitdown "/path/to/docs/file.docx" > "/path/to/output/file.md"
PowerPoint to Markdown
# Convert presentation
markitdown "presentation.pptx" > slides.md
# WSL path
markitdown "/mnt/c/Users/username/Desktop/slides.pptx" > slides.md
Windows/WSL Path Conversion
Basic Path Conversion Rules
# Windows path
C:\Users\username\Documents\file.doc
# WSL equivalent
/mnt/c/Users/username/Documents/file.doc
Conversion Examples
# Single backslash to forward slash
C:\folder\file.txt
→ /mnt/c/folder/file.txt
# Path with spaces (must use quotes)
C:\Users\John Doe\Documents\report.pdf
→ "/mnt/c/Users/John Doe/Documents/report.pdf"
# OneDrive path
C:\Users\username\OneDrive\Documents\file.doc
→ "/mnt/c/Users/username/OneDrive/Documents/file.doc"
# Different drive letters
D:\Projects\document.docx
→ /mnt/d/Projects/document.docx
Using convert_path.py Helper
# Automatic conversion
python scripts/convert_path.py "C:\Users\username\Downloads\document.doc"
# Output: /mnt/c/Users/username/Downloads/document.doc
# Use in conversion command
wsl_path=$(python scripts/convert_path.py "C:\Users\username\file.docx")
markitdown "$wsl_path" > output.md
Batch Conversions
Convert Multiple Files
# Convert all PDFs in a directory
for pdf in /path/to/pdfs/*.pdf; do
filename=$(basename "$pdf" .pdf)
markitdown "$pdf" > "/path/to/output/${filename}.md"
done
# Convert all Word documents
for doc in /path/to/docs/*.docx; do
filename=$(basename "$doc" .docx)
markitdown "$doc" > "/path/to/output/${filename}.md"
done
Batch Conversion with Path Conversion
# Windows batch (PowerShell)
Get-ChildItem "C:\Documents\*.pdf" | ForEach-Object {
$wslPath = "/mnt/c/Documents/$($_.Name)"
$outFile = "/mnt/c/Output/$($_.BaseName).md"
wsl markitdown $wslPath > $outFile
}
Confluence Export Handling
Simple Confluence Export
# Direct conversion for exports without special characters
markitdown "confluence-export.doc" > output.md
Export with Special Characters
For Confluence exports containing special characters:
-
Save the .doc file to an accessible location
-
Try direct conversion first:
markitdown "confluence-export.doc" > output.md -
If special characters cause issues:
- Open in Word and save as .docx
- Or use LibreOffice to convert:
libreoffice --headless --convert-to docx export.doc - Then convert the .docx file
Handling Encoding Issues
# Check file encoding
file -i "document.doc"
# Convert if needed (using iconv)
iconv -f ISO-8859-1 -t UTF-8 input.md > output.md
Advanced Conversion Scenarios
Preserving Directory Structure
# Mirror directory structure
src_dir="/mnt/c/Users/username/Documents"
out_dir="/path/to/output"
find "$src_dir" -name "*.docx" | while read file; do
# Get relative path
rel_path="${file#$src_dir/}"
out_file="$out_dir/${rel_path%.docx}.md"
# Create output directory
mkdir -p "$(dirname "$out_file")"
# Convert
markitdown "$file" > "$out_file"
done
Conversion with Metadata
# Add frontmatter to converted file
{
echo "---"
echo "title: $(basename "$file" .pdf)"
echo "converted: $(date -I)"
echo "source: $file"
echo "---"
echo ""
markitdown "$file"
} > output.md
Error Recovery
Handling Failed Conversions
# Check if markitdown succeeded
if markitdown "document.pdf" > output.md 2> error.log; then
echo "Conversion successful"
else
echo "Conversion failed, check error.log"
fi
Retry Logic
# Retry failed conversions
for file in *.pdf; do
output="${file%.pdf}.md"
if ! [ -f "$output" ]; then
echo "Converting $file..."
markitdown "$file" > "$output" || echo "Failed: $file" >> failed.txt
fi
done
Quality Verification
Check Conversion Quality
# Compare line counts
wc -l document.pdf.md
# Check for common issues
grep "TODO\|ERROR\|MISSING" output.md
# Preview first/last lines
head -n 20 output.md
tail -n 20 output.md
Validate Output
# Check for empty files
if [ ! -s output.md ]; then
echo "Warning: Output file is empty"
fi
# Verify markdown syntax
# Use a markdown linter if available
markdownlint output.md
Best Practices
1. Path Handling
- Always quote paths with spaces
- Verify paths exist before conversion
- Use absolute paths for scripts
2. Batch Processing
- Log conversions for audit trail
- Handle errors gracefully
- Preserve original files
3. Output Organization
- Mirror source directory structure
- Use consistent naming conventions
- Separate by document type or date
4. Quality Assurance
- Spot-check random conversions
- Validate critical documents manually
- Keep conversion logs
5. Performance
- Use parallel processing for large batches
- Skip already converted files
- Clean up temporary files
Common Patterns
Pattern: Convert and Review
#!/bin/bash
file="$1"
output="${file%.*}.md"
# Convert
markitdown "$file" > "$output"
# Open in editor for review
${EDITOR:-vim} "$output"
Pattern: Safe Conversion
#!/bin/bash
file="$1"
backup="${file}.backup"
output="${file%.*}.md"
# Backup original
cp "$file" "$backup"
# Convert with error handling
if markitdown "$file" > "$output" 2> conversion.log; then
echo "Success: $output"
rm "$backup"
else
echo "Failed: Check conversion.log"
mv "$backup" "$file"
fi
Pattern: Metadata Preservation
#!/bin/bash
# Extract and preserve document metadata
file="$1"
output="${file%.*}.md"
# Get file metadata
created=$(stat -c %w "$file" 2>/dev/null || stat -f %SB "$file")
modified=$(stat -c %y "$file" 2>/dev/null || stat -f %Sm "$file")
# Convert with metadata
{
echo "---"
echo "original_file: $(basename "$file")"
echo "created: $created"
echo "modified: $modified"
echo "converted: $(date -I)"
echo "---"
echo ""
markitdown "$file"
} > "$output"