# Web Content Extraction Reference This document provides detailed information about extracting content from HTML, YouTube, EPUB, and other web-based formats. ## HTML Conversion Convert HTML files and web pages to clean Markdown format. ### Basic HTML Conversion ```python from markitdown import MarkItDown md = MarkItDown() result = md.convert("webpage.html") print(result.text_content) ``` ### HTML Processing Features **What's preserved:** - Headings (`

` → `#`, `

` → `##`, etc.) - Paragraphs and text formatting - Links (`` → `[text](url)`) - Lists (ordered and unordered) - Tables → Markdown tables - Code blocks and inline code - Emphasis (bold, italic) **What's removed:** - Scripts and styles - Navigation elements - Advertising content - Boilerplate markup - HTML comments ### HTML from URLs Convert web pages directly from URLs: ```python from markitdown import MarkItDown import requests md = MarkItDown() # Fetch and convert web page response = requests.get("https://example.com/article") with open("temp.html", "wb") as f: f.write(response.content) result = md.convert("temp.html") print(result.text_content) ``` ### Clean Web Article Extraction For extracting main content from web articles: ```python from markitdown import MarkItDown import requests from readability import Document # pip install readability-lxml md = MarkItDown() # Fetch page url = "https://example.com/article" response = requests.get(url) # Extract main content doc = Document(response.content) html_content = doc.summary() # Save and convert with open("article.html", "w") as f: f.write(html_content) result = md.convert("article.html") print(result.text_content) ``` ### HTML with Images HTML files containing images can be enhanced with LLM descriptions: ```python from markitdown import MarkItDown from openai import OpenAI client = OpenAI() md = MarkItDown(llm_client=client, llm_model="gpt-4o") result = md.convert("page_with_images.html") ``` ## YouTube Transcripts Extract video transcripts from YouTube videos. ### Basic YouTube Conversion ```python from markitdown import MarkItDown md = MarkItDown() result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID") print(result.text_content) ``` ### YouTube Installation ```bash pip install 'markitdown[youtube]' ``` This installs the `youtube-transcript-api` dependency. ### YouTube URL Formats MarkItDown supports various YouTube URL formats: - `https://www.youtube.com/watch?v=VIDEO_ID` - `https://youtu.be/VIDEO_ID` - `https://www.youtube.com/embed/VIDEO_ID` - `https://m.youtube.com/watch?v=VIDEO_ID` ### YouTube Transcript Features **What's included:** - Full video transcript text - Timestamps (optional, depending on availability) - Video metadata (title, description) - Captions in available languages **Transcript languages:** ```python from markitdown import MarkItDown md = MarkItDown() # Get transcript in specific language (if available) # Language codes: 'en', 'es', 'fr', 'de', etc. result = md.convert("https://youtube.com/watch?v=VIDEO_ID") ``` ### YouTube Playlist Processing Process multiple videos from a playlist: ```python from markitdown import MarkItDown md = MarkItDown() video_ids = [ "VIDEO_ID_1", "VIDEO_ID_2", "VIDEO_ID_3" ] transcripts = [] for vid_id in video_ids: url = f"https://youtube.com/watch?v={vid_id}" result = md.convert(url) transcripts.append({ 'video_id': vid_id, 'transcript': result.text_content }) ``` ### YouTube Use Cases **Content Analysis:** - Analyze video content without watching - Extract key information from tutorials - Build searchable transcript databases **Research:** - Process interview transcripts - Extract lecture content - Analyze presentation content **Accessibility:** - Generate text versions of video content - Create searchable video archives ### YouTube Limitations - Requires videos to have captions/transcripts available - Auto-generated captions may have transcription errors - Some videos may disable transcript access - Rate limiting may apply for bulk processing ## EPUB Books Convert EPUB e-books to Markdown format. ### Basic EPUB Conversion ```python from markitdown import MarkItDown md = MarkItDown() result = md.convert("book.epub") print(result.text_content) ``` ### EPUB Processing Features **What's extracted:** - Book text content - Chapter structure - Headings and formatting - Tables of contents - Footnotes and references **What's preserved:** - Heading hierarchy - Text emphasis (bold, italic) - Links and references - Lists and tables ### EPUB with Images EPUB files often contain images (covers, diagrams, illustrations): ```python from markitdown import MarkItDown from openai import OpenAI client = OpenAI() md = MarkItDown(llm_client=client, llm_model="gpt-4o") result = md.convert("illustrated_book.epub") ``` ### EPUB Use Cases **Research:** - Convert textbooks to searchable format - Extract content for analysis - Build digital libraries **Content Processing:** - Prepare books for LLM training data - Convert to different formats - Create summaries and extracts **Accessibility:** - Convert to more accessible formats - Extract text for screen readers - Process for text-to-speech ## RSS Feeds Process RSS feeds to extract article content. ### Basic RSS Processing ```python from markitdown import MarkItDown import feedparser md = MarkItDown() # Parse RSS feed feed = feedparser.parse("https://example.com/feed.xml") # Convert each entry for entry in feed.entries: # Save entry HTML with open("temp.html", "w") as f: f.write(entry.summary) result = md.convert("temp.html") print(f"## {entry.title}\n\n{result.text_content}\n\n") ``` ## Combined Web Content Workflows ### Web Scraping Pipeline ```python from markitdown import MarkItDown import requests from bs4 import BeautifulSoup md = MarkItDown() def scrape_and_convert(url): """Scrape webpage and convert to Markdown.""" response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # Extract main content main_content = soup.find('article') or soup.find('main') if main_content: # Save HTML with open("temp.html", "w") as f: f.write(str(main_content)) # Convert to Markdown result = md.convert("temp.html") return result.text_content return None # Use it markdown = scrape_and_convert("https://example.com/article") print(markdown) ``` ### YouTube Learning Content Extraction ```python from markitdown import MarkItDown md = MarkItDown() # Course videos course_videos = [ ("https://youtube.com/watch?v=ID1", "Lesson 1: Introduction"), ("https://youtube.com/watch?v=ID2", "Lesson 2: Basics"), ("https://youtube.com/watch?v=ID3", "Lesson 3: Advanced") ] course_content = [] for url, title in course_videos: result = md.convert(url) course_content.append(f"# {title}\n\n{result.text_content}") # Combine into course document full_course = "\n\n---\n\n".join(course_content) with open("course_transcript.md", "w") as f: f.write(full_course) ``` ### Documentation Scraping ```python from markitdown import MarkItDown import requests from urllib.parse import urljoin, urlparse md = MarkItDown() def scrape_documentation(base_url, page_urls): """Scrape multiple documentation pages.""" docs = [] for page_url in page_urls: full_url = urljoin(base_url, page_url) # Fetch page response = requests.get(full_url) with open("temp.html", "wb") as f: f.write(response.content) # Convert result = md.convert("temp.html") docs.append({ 'url': full_url, 'content': result.text_content }) return docs # Example usage base = "https://docs.example.com/" pages = ["intro.html", "getting-started.html", "api.html"] documentation = scrape_documentation(base, pages) ``` ### EPUB Library Processing ```python from markitdown import MarkItDown import os md = MarkItDown() def process_epub_library(library_path, output_path): """Convert all EPUB books in a directory.""" for filename in os.listdir(library_path): if filename.endswith('.epub'): epub_path = os.path.join(library_path, filename) try: result = md.convert(epub_path) # Save markdown output_file = filename.replace('.epub', '.md') output_full = os.path.join(output_path, output_file) with open(output_full, 'w') as f: f.write(result.text_content) print(f"Converted: {filename}") except Exception as e: print(f"Failed to convert {filename}: {e}") # Process library process_epub_library("books", "markdown_books") ``` ## Error Handling ### HTML Conversion Errors ```python from markitdown import MarkItDown md = MarkItDown() try: result = md.convert("webpage.html") print(result.text_content) except FileNotFoundError: print("HTML file not found") except Exception as e: print(f"Conversion error: {e}") ``` ### YouTube Transcript Errors ```python from markitdown import MarkItDown md = MarkItDown() try: result = md.convert("https://youtube.com/watch?v=VIDEO_ID") print(result.text_content) except Exception as e: print(f"Failed to get transcript: {e}") # Common issues: No transcript available, video unavailable, network error ``` ### EPUB Conversion Errors ```python from markitdown import MarkItDown md = MarkItDown() try: result = md.convert("book.epub") print(result.text_content) except Exception as e: print(f"EPUB processing error: {e}") # Common issues: Corrupted file, unsupported DRM, invalid format ``` ## Best Practices ### HTML Processing - Clean HTML before conversion for better results - Use readability libraries to extract main content - Handle different encodings appropriately - Remove unnecessary markup ### YouTube Processing - Check transcript availability before batch processing - Handle API rate limits gracefully - Store transcripts to avoid re-fetching - Respect YouTube's terms of service ### EPUB Processing - DRM-protected EPUBs cannot be processed - Large EPUBs may require more memory - Some formatting may not translate perfectly - Test with representative samples first ### Web Scraping Ethics - Respect robots.txt - Add delays between requests - Identify your scraper in User-Agent - Cache results to minimize requests - Follow website terms of service