zhongwei/gh-k-dense-ai-claude-scientific-writer

Fork 0

Files

Zhongwei Li 1dd5bee3b4 Initial commit

2025-11-30 08:30:14 +08:00

8.8 KiB

Raw Permalink Blame History

File Format Support

This document provides detailed information about each file format supported by MarkItDown.

Document Formats

PDF (.pdf)

Capabilities:

Text extraction
Table detection
Metadata extraction
OCR for scanned documents (with dependencies)

Dependencies:

pip install 'markitdown[pdf]'

Best For:

Scientific papers
Reports
Books
Forms

Limitations:

Complex layouts may not preserve perfect formatting
Scanned PDFs require OCR setup
Some PDF features (annotations, forms) may not convert

Example:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("research_paper.pdf")
print(result.text_content)

Enhanced with Azure Document Intelligence:

md = MarkItDown(docintel_endpoint="https://YOUR-ENDPOINT.cognitiveservices.azure.com/")
result = md.convert("complex_layout.pdf")

Microsoft Word (.docx)

Capabilities:

Text extraction
Table conversion
Heading hierarchy
List formatting
Basic text formatting (bold, italic)

Dependencies:

pip install 'markitdown[docx]'

Best For:

Research papers
Reports
Documentation
Manuscripts

Preserved Elements:

Headings (converted to Markdown headers)
Tables (converted to Markdown tables)
Lists (bulleted and numbered)
Basic formatting (bold, italic)
Paragraphs

Example:

result = md.convert("manuscript.docx")

PowerPoint (.pptx)

Capabilities:

Slide content extraction
Speaker notes
Table extraction
Image descriptions (with AI)

Dependencies:

pip install 'markitdown[pptx]'

Best For:

Presentations
Lecture slides
Conference talks

Output Format:

# Slide 1: Title

Content from slide 1...

**Notes**: Speaker notes appear here

---

# Slide 2: Next Topic

...

With AI Image Descriptions:

from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("presentation.pptx")

Excel (.xlsx, .xls)

Capabilities:

Sheet extraction
Table formatting
Data preservation
Formula values (calculated)

Dependencies:

pip install 'markitdown[xlsx]'  # Modern Excel
pip install 'markitdown[xls]'   # Legacy Excel

Best For:

Data tables
Research data
Statistical results
Experimental data

Output Format:

# Sheet: Results

| Sample | Control | Treatment | P-value |
|--------|---------|-----------|---------|
| 1      | 10.2    | 12.5      | 0.023   |
| 2      | 9.8     | 11.9      | 0.031   |

Example:

result = md.convert("experimental_data.xlsx")

Image Formats

Images (.jpg, .jpeg, .png, .gif, .webp)

Capabilities:

EXIF metadata extraction
OCR text extraction
AI-powered image descriptions

Dependencies:

pip install 'markitdown[all]'  # Includes image support

Best For:

Scanned documents
Charts and graphs
Scientific diagrams
Photographs with text

Output Without AI:

![Image](image.jpg)

**EXIF Data**:
- Camera: Canon EOS 5D
- Date: 2024-01-15
- Resolution: 4000x3000

Output With AI:

from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this scientific diagram in detail"
)
result = md.convert("graph.png")

OCR for Text Extraction: Requires Tesseract OCR:

# macOS
brew install tesseract

# Ubuntu
sudo apt-get install tesseract-ocr

Audio Formats

Audio (.wav, .mp3)

Capabilities:

Metadata extraction
Speech-to-text transcription
Duration and technical info

Dependencies:

pip install 'markitdown[audio-transcription]'

Best For:

Lecture recordings
Interviews
Podcasts
Meeting recordings

Output Format:

# Audio: interview.mp3

**Metadata**:
- Duration: 45:32
- Bitrate: 320kbps
- Sample Rate: 44100Hz

**Transcription**:
[Transcribed text appears here...]

Example:

result = md.convert("lecture.mp3")

Web Formats

HTML (.html, .htm)

Capabilities:

Clean HTML to Markdown conversion
Link preservation
Table conversion
List formatting

Best For:

Web pages
Documentation
Blog posts
Online articles

Output Format: Clean Markdown with preserved links and structure

Example:

result = md.convert("webpage.html")

YouTube URLs

Capabilities:

Fetch video transcriptions
Extract video metadata
Caption download

Dependencies:

pip install 'markitdown[youtube-transcription]'

Best For:

Educational videos
Lectures
Talks
Tutorials

Example:

result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")

Data Formats

CSV (.csv)

Capabilities:

Automatic table conversion
Delimiter detection
Header preservation

Output Format: Markdown tables

Example:

result = md.convert("data.csv")

Output:

| Column1 | Column2 | Column3 |
|---------|---------|---------|
| Value1  | Value2  | Value3  |

JSON (.json)

Capabilities:

Structured representation
Pretty formatting
Nested data visualization

Best For:

API responses
Configuration files
Data exports

Example:

result = md.convert("data.json")

XML (.xml)

Capabilities:

Structure preservation
Attribute extraction
Formatted output

Best For:

Configuration files
Data interchange
Structured documents

Example:

result = md.convert("config.xml")

Archive Formats

ZIP (.zip)

Capabilities:

Iterates through archive contents
Converts each file individually
Maintains directory structure in output

Best For:

Document collections
Project archives
Batch conversions

Output Format:

# Archive: documents.zip

## File: document1.pdf
[Content from document1.pdf...]

---

## File: document2.docx
[Content from document2.docx...]

Example:

result = md.convert("archive.zip")

E-book Formats

EPUB (.epub)

Capabilities:

Full text extraction
Chapter structure
Metadata extraction

Best For:

E-books
Digital publications
Long-form content

Output Format: Markdown with preserved chapter structure

Example:

result = md.convert("book.epub")

Other Formats

Outlook Messages (.msg)

Capabilities:

Email content extraction
Attachment listing
Metadata (from, to, subject, date)

Dependencies:

pip install 'markitdown[outlook]'

Best For:

Email archives
Communication records

Example:

result = md.convert("message.msg")

Format-Specific Tips

PDF Best Practices

Use Azure Document Intelligence for complex layouts:

md = MarkItDown(docintel_endpoint="endpoint_url")

For scanned PDFs, ensure OCR is set up:
```
brew install tesseract  # macOS
```
Split very large PDFs before conversion for better performance

PowerPoint Best Practices

Use AI for visual content:

md = MarkItDown(llm_client=client, llm_model="gpt-4o")

Check speaker notes - they're included in output
Complex animations won't be captured - static content only

Excel Best Practices

Large spreadsheets may take time to convert
Formulas are converted to their calculated values
Multiple sheets are all included in output
Charts become text descriptions (use AI for better descriptions)

Image Best Practices

Use AI for meaningful descriptions:

md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this scientific figure in detail"
)

For text-heavy images, ensure OCR dependencies are installed
High-resolution images may take longer to process

Audio Best Practices

Clear audio produces better transcriptions
Long recordings may take significant time
Consider splitting long audio files for faster processing

Unsupported Formats

If you need to convert an unsupported format:

Create a custom converter (see api_reference.md)
Look for plugins on GitHub (#markitdown-plugin)
Pre-convert to supported format (e.g., convert .rtf to .docx)

Format Detection

MarkItDown automatically detects format from:

File extension (primary method)
MIME type (fallback)
File signature (magic bytes, fallback)

Override detection:

# Force specific format
result = md.convert("file_without_extension", file_extension=".pdf")

# With streams
with open("file", "rb") as f:
    result = md.convert_stream(f, file_extension=".pdf")

8.8 KiB Raw Permalink Blame History

File Format Support

Document Formats

PDF (.pdf)

Microsoft Word (.docx)

PowerPoint (.pptx)

Excel (.xlsx, .xls)

Image Formats

Images (.jpg, .jpeg, .png, .gif, .webp)

Audio Formats

Audio (.wav, .mp3)

Web Formats

HTML (.html, .htm)

YouTube URLs

Data Formats

CSV (.csv)

JSON (.json)

XML (.xml)

Archive Formats

ZIP (.zip)

E-book Formats

EPUB (.epub)

Other Formats

Outlook Messages (.msg)

Format-Specific Tips

PDF Best Practices

PowerPoint Best Practices

Excel Best Practices

Image Best Practices

Audio Best Practices

Unsupported Formats

Format Detection

8.8 KiB

Raw Permalink Blame History