Files
2025-11-29 18:18:23 +08:00

6.5 KiB

Document Conversion Examples

Comprehensive examples for converting various document formats to markdown.

Basic Document Conversions

PDF to Markdown

# Simple PDF conversion
markitdown "document.pdf" > output.md

# WSL path example
markitdown "/mnt/c/Users/username/Documents/report.pdf" > report.md

# With explicit output
markitdown "slides.pdf" > "slides.md"

Word Documents to Markdown

# Modern Word document (.docx)
markitdown "document.docx" > output.md

# Legacy Word document (.doc)
markitdown "legacy-doc.doc" > output.md

# Preserve directory structure
markitdown "/path/to/docs/file.docx" > "/path/to/output/file.md"

PowerPoint to Markdown

# Convert presentation
markitdown "presentation.pptx" > slides.md

# WSL path
markitdown "/mnt/c/Users/username/Desktop/slides.pptx" > slides.md

Windows/WSL Path Conversion

Basic Path Conversion Rules

# Windows path
C:\Users\username\Documents\file.doc

# WSL equivalent
/mnt/c/Users/username/Documents/file.doc

Conversion Examples

# Single backslash to forward slash
C:\folder\file.txt
→ /mnt/c/folder/file.txt

# Path with spaces (must use quotes)
C:\Users\John Doe\Documents\report.pdf
→ "/mnt/c/Users/John Doe/Documents/report.pdf"

# OneDrive path
C:\Users\username\OneDrive\Documents\file.doc
→ "/mnt/c/Users/username/OneDrive/Documents/file.doc"

# Different drive letters
D:\Projects\document.docx
→ /mnt/d/Projects/document.docx

Using convert_path.py Helper

# Automatic conversion
python scripts/convert_path.py "C:\Users\username\Downloads\document.doc"
# Output: /mnt/c/Users/username/Downloads/document.doc

# Use in conversion command
wsl_path=$(python scripts/convert_path.py "C:\Users\username\file.docx")
markitdown "$wsl_path" > output.md

Batch Conversions

Convert Multiple Files

# Convert all PDFs in a directory
for pdf in /path/to/pdfs/*.pdf; do
  filename=$(basename "$pdf" .pdf)
  markitdown "$pdf" > "/path/to/output/${filename}.md"
done

# Convert all Word documents
for doc in /path/to/docs/*.docx; do
  filename=$(basename "$doc" .docx)
  markitdown "$doc" > "/path/to/output/${filename}.md"
done

Batch Conversion with Path Conversion

# Windows batch (PowerShell)
Get-ChildItem "C:\Documents\*.pdf" | ForEach-Object {
  $wslPath = "/mnt/c/Documents/$($_.Name)"
  $outFile = "/mnt/c/Output/$($_.BaseName).md"
  wsl markitdown $wslPath > $outFile
}

Confluence Export Handling

Simple Confluence Export

# Direct conversion for exports without special characters
markitdown "confluence-export.doc" > output.md

Export with Special Characters

For Confluence exports containing special characters:

  1. Save the .doc file to an accessible location

  2. Try direct conversion first:

    markitdown "confluence-export.doc" > output.md
    
  3. If special characters cause issues:

    • Open in Word and save as .docx
    • Or use LibreOffice to convert: libreoffice --headless --convert-to docx export.doc
    • Then convert the .docx file

Handling Encoding Issues

# Check file encoding
file -i "document.doc"

# Convert if needed (using iconv)
iconv -f ISO-8859-1 -t UTF-8 input.md > output.md

Advanced Conversion Scenarios

Preserving Directory Structure

# Mirror directory structure
src_dir="/mnt/c/Users/username/Documents"
out_dir="/path/to/output"

find "$src_dir" -name "*.docx" | while read file; do
  # Get relative path
  rel_path="${file#$src_dir/}"
  out_file="$out_dir/${rel_path%.docx}.md"

  # Create output directory
  mkdir -p "$(dirname "$out_file")"

  # Convert
  markitdown "$file" > "$out_file"
done

Conversion with Metadata

# Add frontmatter to converted file
{
  echo "---"
  echo "title: $(basename "$file" .pdf)"
  echo "converted: $(date -I)"
  echo "source: $file"
  echo "---"
  echo ""
  markitdown "$file"
} > output.md

Error Recovery

Handling Failed Conversions

# Check if markitdown succeeded
if markitdown "document.pdf" > output.md 2> error.log; then
  echo "Conversion successful"
else
  echo "Conversion failed, check error.log"
fi

Retry Logic

# Retry failed conversions
for file in *.pdf; do
  output="${file%.pdf}.md"
  if ! [ -f "$output" ]; then
    echo "Converting $file..."
    markitdown "$file" > "$output" || echo "Failed: $file" >> failed.txt
  fi
done

Quality Verification

Check Conversion Quality

# Compare line counts
wc -l document.pdf.md

# Check for common issues
grep "TODO\|ERROR\|MISSING" output.md

# Preview first/last lines
head -n 20 output.md
tail -n 20 output.md

Validate Output

# Check for empty files
if [ ! -s output.md ]; then
  echo "Warning: Output file is empty"
fi

# Verify markdown syntax
# Use a markdown linter if available
markdownlint output.md

Best Practices

1. Path Handling

  • Always quote paths with spaces
  • Verify paths exist before conversion
  • Use absolute paths for scripts

2. Batch Processing

  • Log conversions for audit trail
  • Handle errors gracefully
  • Preserve original files

3. Output Organization

  • Mirror source directory structure
  • Use consistent naming conventions
  • Separate by document type or date

4. Quality Assurance

  • Spot-check random conversions
  • Validate critical documents manually
  • Keep conversion logs

5. Performance

  • Use parallel processing for large batches
  • Skip already converted files
  • Clean up temporary files

Common Patterns

Pattern: Convert and Review

#!/bin/bash
file="$1"
output="${file%.*}.md"

# Convert
markitdown "$file" > "$output"

# Open in editor for review
${EDITOR:-vim} "$output"

Pattern: Safe Conversion

#!/bin/bash
file="$1"
backup="${file}.backup"
output="${file%.*}.md"

# Backup original
cp "$file" "$backup"

# Convert with error handling
if markitdown "$file" > "$output" 2> conversion.log; then
  echo "Success: $output"
  rm "$backup"
else
  echo "Failed: Check conversion.log"
  mv "$backup" "$file"
fi

Pattern: Metadata Preservation

#!/bin/bash
# Extract and preserve document metadata

file="$1"
output="${file%.*}.md"

# Get file metadata
created=$(stat -c %w "$file" 2>/dev/null || stat -f %SB "$file")
modified=$(stat -c %y "$file" 2>/dev/null || stat -f %Sm "$file")

# Convert with metadata
{
  echo "---"
  echo "original_file: $(basename "$file")"
  echo "created: $created"
  echo "modified: $modified"
  echo "converted: $(date -I)"
  echo "---"
  echo ""
  markitdown "$file"
} > "$output"