Files
gh-k-dense-ai-claude-scient…/skills/markitdown/references/web_content.md
2025-11-30 08:30:10 +08:00

10 KiB

Web Content Extraction Reference

This document provides detailed information about extracting content from HTML, YouTube, EPUB, and other web-based formats.

HTML Conversion

Convert HTML files and web pages to clean Markdown format.

Basic HTML Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("webpage.html")
print(result.text_content)

HTML Processing Features

What's preserved:

  • Headings (<h1>#, <h2>##, etc.)
  • Paragraphs and text formatting
  • Links (<a>[text](url))
  • Lists (ordered and unordered)
  • Tables → Markdown tables
  • Code blocks and inline code
  • Emphasis (bold, italic)

What's removed:

  • Scripts and styles
  • Navigation elements
  • Advertising content
  • Boilerplate markup
  • HTML comments

HTML from URLs

Convert web pages directly from URLs:

from markitdown import MarkItDown
import requests

md = MarkItDown()

# Fetch and convert web page
response = requests.get("https://example.com/article")
with open("temp.html", "wb") as f:
    f.write(response.content)

result = md.convert("temp.html")
print(result.text_content)

Clean Web Article Extraction

For extracting main content from web articles:

from markitdown import MarkItDown
import requests
from readability import Document  # pip install readability-lxml

md = MarkItDown()

# Fetch page
url = "https://example.com/article"
response = requests.get(url)

# Extract main content
doc = Document(response.content)
html_content = doc.summary()

# Save and convert
with open("article.html", "w") as f:
    f.write(html_content)

result = md.convert("article.html")
print(result.text_content)

HTML with Images

HTML files containing images can be enhanced with LLM descriptions:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("page_with_images.html")

YouTube Transcripts

Extract video transcripts from YouTube videos.

Basic YouTube Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
print(result.text_content)

YouTube Installation

pip install 'markitdown[youtube]'

This installs the youtube-transcript-api dependency.

YouTube URL Formats

MarkItDown supports various YouTube URL formats:

  • https://www.youtube.com/watch?v=VIDEO_ID
  • https://youtu.be/VIDEO_ID
  • https://www.youtube.com/embed/VIDEO_ID
  • https://m.youtube.com/watch?v=VIDEO_ID

YouTube Transcript Features

What's included:

  • Full video transcript text
  • Timestamps (optional, depending on availability)
  • Video metadata (title, description)
  • Captions in available languages

Transcript languages:

from markitdown import MarkItDown

md = MarkItDown()

# Get transcript in specific language (if available)
# Language codes: 'en', 'es', 'fr', 'de', etc.
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")

YouTube Playlist Processing

Process multiple videos from a playlist:

from markitdown import MarkItDown

md = MarkItDown()

video_ids = [
    "VIDEO_ID_1",
    "VIDEO_ID_2",
    "VIDEO_ID_3"
]

transcripts = []
for vid_id in video_ids:
    url = f"https://youtube.com/watch?v={vid_id}"
    result = md.convert(url)
    transcripts.append({
        'video_id': vid_id,
        'transcript': result.text_content
    })

YouTube Use Cases

Content Analysis:

  • Analyze video content without watching
  • Extract key information from tutorials
  • Build searchable transcript databases

Research:

  • Process interview transcripts
  • Extract lecture content
  • Analyze presentation content

Accessibility:

  • Generate text versions of video content
  • Create searchable video archives

YouTube Limitations

  • Requires videos to have captions/transcripts available
  • Auto-generated captions may have transcription errors
  • Some videos may disable transcript access
  • Rate limiting may apply for bulk processing

EPUB Books

Convert EPUB e-books to Markdown format.

Basic EPUB Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("book.epub")
print(result.text_content)

EPUB Processing Features

What's extracted:

  • Book text content
  • Chapter structure
  • Headings and formatting
  • Tables of contents
  • Footnotes and references

What's preserved:

  • Heading hierarchy
  • Text emphasis (bold, italic)
  • Links and references
  • Lists and tables

EPUB with Images

EPUB files often contain images (covers, diagrams, illustrations):

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("illustrated_book.epub")

EPUB Use Cases

Research:

  • Convert textbooks to searchable format
  • Extract content for analysis
  • Build digital libraries

Content Processing:

  • Prepare books for LLM training data
  • Convert to different formats
  • Create summaries and extracts

Accessibility:

  • Convert to more accessible formats
  • Extract text for screen readers
  • Process for text-to-speech

RSS Feeds

Process RSS feeds to extract article content.

Basic RSS Processing

from markitdown import MarkItDown
import feedparser

md = MarkItDown()

# Parse RSS feed
feed = feedparser.parse("https://example.com/feed.xml")

# Convert each entry
for entry in feed.entries:
    # Save entry HTML
    with open("temp.html", "w") as f:
        f.write(entry.summary)

    result = md.convert("temp.html")
    print(f"## {entry.title}\n\n{result.text_content}\n\n")

Combined Web Content Workflows

Web Scraping Pipeline

from markitdown import MarkItDown
import requests
from bs4 import BeautifulSoup

md = MarkItDown()

def scrape_and_convert(url):
    """Scrape webpage and convert to Markdown."""
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract main content
    main_content = soup.find('article') or soup.find('main')

    if main_content:
        # Save HTML
        with open("temp.html", "w") as f:
            f.write(str(main_content))

        # Convert to Markdown
        result = md.convert("temp.html")
        return result.text_content

    return None

# Use it
markdown = scrape_and_convert("https://example.com/article")
print(markdown)

YouTube Learning Content Extraction

from markitdown import MarkItDown

md = MarkItDown()

# Course videos
course_videos = [
    ("https://youtube.com/watch?v=ID1", "Lesson 1: Introduction"),
    ("https://youtube.com/watch?v=ID2", "Lesson 2: Basics"),
    ("https://youtube.com/watch?v=ID3", "Lesson 3: Advanced")
]

course_content = []
for url, title in course_videos:
    result = md.convert(url)
    course_content.append(f"# {title}\n\n{result.text_content}")

# Combine into course document
full_course = "\n\n---\n\n".join(course_content)
with open("course_transcript.md", "w") as f:
    f.write(full_course)

Documentation Scraping

from markitdown import MarkItDown
import requests
from urllib.parse import urljoin, urlparse

md = MarkItDown()

def scrape_documentation(base_url, page_urls):
    """Scrape multiple documentation pages."""
    docs = []

    for page_url in page_urls:
        full_url = urljoin(base_url, page_url)

        # Fetch page
        response = requests.get(full_url)
        with open("temp.html", "wb") as f:
            f.write(response.content)

        # Convert
        result = md.convert("temp.html")
        docs.append({
            'url': full_url,
            'content': result.text_content
        })

    return docs

# Example usage
base = "https://docs.example.com/"
pages = ["intro.html", "getting-started.html", "api.html"]
documentation = scrape_documentation(base, pages)

EPUB Library Processing

from markitdown import MarkItDown
import os

md = MarkItDown()

def process_epub_library(library_path, output_path):
    """Convert all EPUB books in a directory."""
    for filename in os.listdir(library_path):
        if filename.endswith('.epub'):
            epub_path = os.path.join(library_path, filename)

            try:
                result = md.convert(epub_path)

                # Save markdown
                output_file = filename.replace('.epub', '.md')
                output_full = os.path.join(output_path, output_file)

                with open(output_full, 'w') as f:
                    f.write(result.text_content)

                print(f"Converted: {filename}")
            except Exception as e:
                print(f"Failed to convert {filename}: {e}")

# Process library
process_epub_library("books", "markdown_books")

Error Handling

HTML Conversion Errors

from markitdown import MarkItDown

md = MarkItDown()

try:
    result = md.convert("webpage.html")
    print(result.text_content)
except FileNotFoundError:
    print("HTML file not found")
except Exception as e:
    print(f"Conversion error: {e}")

YouTube Transcript Errors

from markitdown import MarkItDown

md = MarkItDown()

try:
    result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
    print(result.text_content)
except Exception as e:
    print(f"Failed to get transcript: {e}")
    # Common issues: No transcript available, video unavailable, network error

EPUB Conversion Errors

from markitdown import MarkItDown

md = MarkItDown()

try:
    result = md.convert("book.epub")
    print(result.text_content)
except Exception as e:
    print(f"EPUB processing error: {e}")
    # Common issues: Corrupted file, unsupported DRM, invalid format

Best Practices

HTML Processing

  • Clean HTML before conversion for better results
  • Use readability libraries to extract main content
  • Handle different encodings appropriately
  • Remove unnecessary markup

YouTube Processing

  • Check transcript availability before batch processing
  • Handle API rate limits gracefully
  • Store transcripts to avoid re-fetching
  • Respect YouTube's terms of service

EPUB Processing

  • DRM-protected EPUBs cannot be processed
  • Large EPUBs may require more memory
  • Some formatting may not translate perfectly
  • Test with representative samples first

Web Scraping Ethics

  • Respect robots.txt
  • Add delays between requests
  • Identify your scraper in User-Agent
  • Cache results to minimize requests
  • Follow website terms of service