Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/markitdown/references/web_content.md
+++ b/skills/markitdown/references/web_content.md
@@ -0,0 +1,478 @@
+# Web Content Extraction Reference
+
+This document provides detailed information about extracting content from HTML, YouTube, EPUB, and other web-based formats.
+
+## HTML Conversion
+
+Convert HTML files and web pages to clean Markdown format.
+
+### Basic HTML Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("webpage.html")
+print(result.text_content)
+```
+
+### HTML Processing Features
+
+**What's preserved:**
+- Headings (`<h1>` → `#`, `<h2>` → `##`, etc.)
+- Paragraphs and text formatting
+- Links (`<a>` → `[text](url)`)
+- Lists (ordered and unordered)
+- Tables → Markdown tables
+- Code blocks and inline code
+- Emphasis (bold, italic)
+
+**What's removed:**
+- Scripts and styles
+- Navigation elements
+- Advertising content
+- Boilerplate markup
+- HTML comments
+
+### HTML from URLs
+
+Convert web pages directly from URLs:
+
+```python
+from markitdown import MarkItDown
+import requests
+
+md = MarkItDown()
+
+# Fetch and convert web page
+response = requests.get("https://example.com/article")
+with open("temp.html", "wb") as f:
+    f.write(response.content)
+
+result = md.convert("temp.html")
+print(result.text_content)
+```
+
+### Clean Web Article Extraction
+
+For extracting main content from web articles:
+
+```python
+from markitdown import MarkItDown
+import requests
+from readability import Document  # pip install readability-lxml
+
+md = MarkItDown()
+
+# Fetch page
+url = "https://example.com/article"
+response = requests.get(url)
+
+# Extract main content
+doc = Document(response.content)
+html_content = doc.summary()
+
+# Save and convert
+with open("article.html", "w") as f:
+    f.write(html_content)
+
+result = md.convert("article.html")
+print(result.text_content)
+```
+
+### HTML with Images
+
+HTML files containing images can be enhanced with LLM descriptions:
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+result = md.convert("page_with_images.html")
+```
+
+## YouTube Transcripts
+
+Extract video transcripts from YouTube videos.
+
+### Basic YouTube Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
+print(result.text_content)
+```
+
+### YouTube Installation
+
+```bash
+pip install 'markitdown[youtube]'
+```
+
+This installs the `youtube-transcript-api` dependency.
+
+### YouTube URL Formats
+
+MarkItDown supports various YouTube URL formats:
+- `https://www.youtube.com/watch?v=VIDEO_ID`
+- `https://youtu.be/VIDEO_ID`
+- `https://www.youtube.com/embed/VIDEO_ID`
+- `https://m.youtube.com/watch?v=VIDEO_ID`
+
+### YouTube Transcript Features
+
+**What's included:**
+- Full video transcript text
+- Timestamps (optional, depending on availability)
+- Video metadata (title, description)
+- Captions in available languages
+
+**Transcript languages:**
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Get transcript in specific language (if available)
+# Language codes: 'en', 'es', 'fr', 'de', etc.
+result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
+```
+
+### YouTube Playlist Processing
+
+Process multiple videos from a playlist:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+video_ids = [
+    "VIDEO_ID_1",
+    "VIDEO_ID_2",
+    "VIDEO_ID_3"
+]
+
+transcripts = []
+for vid_id in video_ids:
+    url = f"https://youtube.com/watch?v={vid_id}"
+    result = md.convert(url)
+    transcripts.append({
+        'video_id': vid_id,
+        'transcript': result.text_content
+    })
+```
+
+### YouTube Use Cases
+
+**Content Analysis:**
+- Analyze video content without watching
+- Extract key information from tutorials
+- Build searchable transcript databases
+
+**Research:**
+- Process interview transcripts
+- Extract lecture content
+- Analyze presentation content
+
+**Accessibility:**
+- Generate text versions of video content
+- Create searchable video archives
+
+### YouTube Limitations
+
+- Requires videos to have captions/transcripts available
+- Auto-generated captions may have transcription errors
+- Some videos may disable transcript access
+- Rate limiting may apply for bulk processing
+
+## EPUB Books
+
+Convert EPUB e-books to Markdown format.
+
+### Basic EPUB Conversion
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("book.epub")
+print(result.text_content)
+```
+
+### EPUB Processing Features
+
+**What's extracted:**
+- Book text content
+- Chapter structure
+- Headings and formatting
+- Tables of contents
+- Footnotes and references
+
+**What's preserved:**
+- Heading hierarchy
+- Text emphasis (bold, italic)
+- Links and references
+- Lists and tables
+
+### EPUB with Images
+
+EPUB files often contain images (covers, diagrams, illustrations):
+
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+
+client = OpenAI()
+md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+result = md.convert("illustrated_book.epub")
+```
+
+### EPUB Use Cases
+
+**Research:**
+- Convert textbooks to searchable format
+- Extract content for analysis
+- Build digital libraries
+
+**Content Processing:**
+- Prepare books for LLM training data
+- Convert to different formats
+- Create summaries and extracts
+
+**Accessibility:**
+- Convert to more accessible formats
+- Extract text for screen readers
+- Process for text-to-speech
+
+## RSS Feeds
+
+Process RSS feeds to extract article content.
+
+### Basic RSS Processing
+
+```python
+from markitdown import MarkItDown
+import feedparser
+
+md = MarkItDown()
+
+# Parse RSS feed
+feed = feedparser.parse("https://example.com/feed.xml")
+
+# Convert each entry
+for entry in feed.entries:
+    # Save entry HTML
+    with open("temp.html", "w") as f:
+        f.write(entry.summary)
+
+    result = md.convert("temp.html")
+    print(f"## {entry.title}\n\n{result.text_content}\n\n")
+```
+
+## Combined Web Content Workflows
+
+### Web Scraping Pipeline
+
+```python
+from markitdown import MarkItDown
+import requests
+from bs4 import BeautifulSoup
+
+md = MarkItDown()
+
+def scrape_and_convert(url):
+    """Scrape webpage and convert to Markdown."""
+    response = requests.get(url)
+    soup = BeautifulSoup(response.content, 'html.parser')
+
+    # Extract main content
+    main_content = soup.find('article') or soup.find('main')
+
+    if main_content:
+        # Save HTML
+        with open("temp.html", "w") as f:
+            f.write(str(main_content))
+
+        # Convert to Markdown
+        result = md.convert("temp.html")
+        return result.text_content
+
+    return None
+
+# Use it
+markdown = scrape_and_convert("https://example.com/article")
+print(markdown)
+```
+
+### YouTube Learning Content Extraction
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+# Course videos
+course_videos = [
+    ("https://youtube.com/watch?v=ID1", "Lesson 1: Introduction"),
+    ("https://youtube.com/watch?v=ID2", "Lesson 2: Basics"),
+    ("https://youtube.com/watch?v=ID3", "Lesson 3: Advanced")
+]
+
+course_content = []
+for url, title in course_videos:
+    result = md.convert(url)
+    course_content.append(f"# {title}\n\n{result.text_content}")
+
+# Combine into course document
+full_course = "\n\n---\n\n".join(course_content)
+with open("course_transcript.md", "w") as f:
+    f.write(full_course)
+```
+
+### Documentation Scraping
+
+```python
+from markitdown import MarkItDown
+import requests
+from urllib.parse import urljoin, urlparse
+
+md = MarkItDown()
+
+def scrape_documentation(base_url, page_urls):
+    """Scrape multiple documentation pages."""
+    docs = []
+
+    for page_url in page_urls:
+        full_url = urljoin(base_url, page_url)
+
+        # Fetch page
+        response = requests.get(full_url)
+        with open("temp.html", "wb") as f:
+            f.write(response.content)
+
+        # Convert
+        result = md.convert("temp.html")
+        docs.append({
+            'url': full_url,
+            'content': result.text_content
+        })
+
+    return docs
+
+# Example usage
+base = "https://docs.example.com/"
+pages = ["intro.html", "getting-started.html", "api.html"]
+documentation = scrape_documentation(base, pages)
+```
+
+### EPUB Library Processing
+
+```python
+from markitdown import MarkItDown
+import os
+
+md = MarkItDown()
+
+def process_epub_library(library_path, output_path):
+    """Convert all EPUB books in a directory."""
+    for filename in os.listdir(library_path):
+        if filename.endswith('.epub'):
+            epub_path = os.path.join(library_path, filename)
+
+            try:
+                result = md.convert(epub_path)
+
+                # Save markdown
+                output_file = filename.replace('.epub', '.md')
+                output_full = os.path.join(output_path, output_file)
+
+                with open(output_full, 'w') as f:
+                    f.write(result.text_content)
+
+                print(f"Converted: {filename}")
+            except Exception as e:
+                print(f"Failed to convert {filename}: {e}")
+
+# Process library
+process_epub_library("books", "markdown_books")
+```
+
+## Error Handling
+
+### HTML Conversion Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("webpage.html")
+    print(result.text_content)
+except FileNotFoundError:
+    print("HTML file not found")
+except Exception as e:
+    print(f"Conversion error: {e}")
+```
+
+### YouTube Transcript Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
+    print(result.text_content)
+except Exception as e:
+    print(f"Failed to get transcript: {e}")
+    # Common issues: No transcript available, video unavailable, network error
+```
+
+### EPUB Conversion Errors
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+
+try:
+    result = md.convert("book.epub")
+    print(result.text_content)
+except Exception as e:
+    print(f"EPUB processing error: {e}")
+    # Common issues: Corrupted file, unsupported DRM, invalid format
+```
+
+## Best Practices
+
+### HTML Processing
+- Clean HTML before conversion for better results
+- Use readability libraries to extract main content
+- Handle different encodings appropriately
+- Remove unnecessary markup
+
+### YouTube Processing
+- Check transcript availability before batch processing
+- Handle API rate limits gracefully
+- Store transcripts to avoid re-fetching
+- Respect YouTube's terms of service
+
+### EPUB Processing
+- DRM-protected EPUBs cannot be processed
+- Large EPUBs may require more memory
+- Some formatting may not translate perfectly
+- Test with representative samples first
+
+### Web Scraping Ethics
+- Respect robots.txt
+- Add delays between requests
+- Identify your scraper in User-Agent
+- Cache results to minimize requests
+- Follow website terms of service