Initial commit
This commit is contained in:
478
skills/markitdown/references/web_content.md
Normal file
478
skills/markitdown/references/web_content.md
Normal file
@@ -0,0 +1,478 @@
|
||||
# Web Content Extraction Reference
|
||||
|
||||
This document provides detailed information about extracting content from HTML, YouTube, EPUB, and other web-based formats.
|
||||
|
||||
## HTML Conversion
|
||||
|
||||
Convert HTML files and web pages to clean Markdown format.
|
||||
|
||||
### Basic HTML Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("webpage.html")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### HTML Processing Features
|
||||
|
||||
**What's preserved:**
|
||||
- Headings (`<h1>` → `#`, `<h2>` → `##`, etc.)
|
||||
- Paragraphs and text formatting
|
||||
- Links (`<a>` → `[text](url)`)
|
||||
- Lists (ordered and unordered)
|
||||
- Tables → Markdown tables
|
||||
- Code blocks and inline code
|
||||
- Emphasis (bold, italic)
|
||||
|
||||
**What's removed:**
|
||||
- Scripts and styles
|
||||
- Navigation elements
|
||||
- Advertising content
|
||||
- Boilerplate markup
|
||||
- HTML comments
|
||||
|
||||
### HTML from URLs
|
||||
|
||||
Convert web pages directly from URLs:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Fetch and convert web page
|
||||
response = requests.get("https://example.com/article")
|
||||
with open("temp.html", "wb") as f:
|
||||
f.write(response.content)
|
||||
|
||||
result = md.convert("temp.html")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### Clean Web Article Extraction
|
||||
|
||||
For extracting main content from web articles:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
from readability import Document # pip install readability-lxml
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Fetch page
|
||||
url = "https://example.com/article"
|
||||
response = requests.get(url)
|
||||
|
||||
# Extract main content
|
||||
doc = Document(response.content)
|
||||
html_content = doc.summary()
|
||||
|
||||
# Save and convert
|
||||
with open("article.html", "w") as f:
|
||||
f.write(html_content)
|
||||
|
||||
result = md.convert("article.html")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### HTML with Images
|
||||
|
||||
HTML files containing images can be enhanced with LLM descriptions:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("page_with_images.html")
|
||||
```
|
||||
|
||||
## YouTube Transcripts
|
||||
|
||||
Extract video transcripts from YouTube videos.
|
||||
|
||||
### Basic YouTube Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### YouTube Installation
|
||||
|
||||
```bash
|
||||
pip install 'markitdown[youtube]'
|
||||
```
|
||||
|
||||
This installs the `youtube-transcript-api` dependency.
|
||||
|
||||
### YouTube URL Formats
|
||||
|
||||
MarkItDown supports various YouTube URL formats:
|
||||
- `https://www.youtube.com/watch?v=VIDEO_ID`
|
||||
- `https://youtu.be/VIDEO_ID`
|
||||
- `https://www.youtube.com/embed/VIDEO_ID`
|
||||
- `https://m.youtube.com/watch?v=VIDEO_ID`
|
||||
|
||||
### YouTube Transcript Features
|
||||
|
||||
**What's included:**
|
||||
- Full video transcript text
|
||||
- Timestamps (optional, depending on availability)
|
||||
- Video metadata (title, description)
|
||||
- Captions in available languages
|
||||
|
||||
**Transcript languages:**
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Get transcript in specific language (if available)
|
||||
# Language codes: 'en', 'es', 'fr', 'de', etc.
|
||||
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
|
||||
```
|
||||
|
||||
### YouTube Playlist Processing
|
||||
|
||||
Process multiple videos from a playlist:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
video_ids = [
|
||||
"VIDEO_ID_1",
|
||||
"VIDEO_ID_2",
|
||||
"VIDEO_ID_3"
|
||||
]
|
||||
|
||||
transcripts = []
|
||||
for vid_id in video_ids:
|
||||
url = f"https://youtube.com/watch?v={vid_id}"
|
||||
result = md.convert(url)
|
||||
transcripts.append({
|
||||
'video_id': vid_id,
|
||||
'transcript': result.text_content
|
||||
})
|
||||
```
|
||||
|
||||
### YouTube Use Cases
|
||||
|
||||
**Content Analysis:**
|
||||
- Analyze video content without watching
|
||||
- Extract key information from tutorials
|
||||
- Build searchable transcript databases
|
||||
|
||||
**Research:**
|
||||
- Process interview transcripts
|
||||
- Extract lecture content
|
||||
- Analyze presentation content
|
||||
|
||||
**Accessibility:**
|
||||
- Generate text versions of video content
|
||||
- Create searchable video archives
|
||||
|
||||
### YouTube Limitations
|
||||
|
||||
- Requires videos to have captions/transcripts available
|
||||
- Auto-generated captions may have transcription errors
|
||||
- Some videos may disable transcript access
|
||||
- Rate limiting may apply for bulk processing
|
||||
|
||||
## EPUB Books
|
||||
|
||||
Convert EPUB e-books to Markdown format.
|
||||
|
||||
### Basic EPUB Conversion
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("book.epub")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### EPUB Processing Features
|
||||
|
||||
**What's extracted:**
|
||||
- Book text content
|
||||
- Chapter structure
|
||||
- Headings and formatting
|
||||
- Tables of contents
|
||||
- Footnotes and references
|
||||
|
||||
**What's preserved:**
|
||||
- Heading hierarchy
|
||||
- Text emphasis (bold, italic)
|
||||
- Links and references
|
||||
- Lists and tables
|
||||
|
||||
### EPUB with Images
|
||||
|
||||
EPUB files often contain images (covers, diagrams, illustrations):
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI()
|
||||
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
result = md.convert("illustrated_book.epub")
|
||||
```
|
||||
|
||||
### EPUB Use Cases
|
||||
|
||||
**Research:**
|
||||
- Convert textbooks to searchable format
|
||||
- Extract content for analysis
|
||||
- Build digital libraries
|
||||
|
||||
**Content Processing:**
|
||||
- Prepare books for LLM training data
|
||||
- Convert to different formats
|
||||
- Create summaries and extracts
|
||||
|
||||
**Accessibility:**
|
||||
- Convert to more accessible formats
|
||||
- Extract text for screen readers
|
||||
- Process for text-to-speech
|
||||
|
||||
## RSS Feeds
|
||||
|
||||
Process RSS feeds to extract article content.
|
||||
|
||||
### Basic RSS Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import feedparser
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Parse RSS feed
|
||||
feed = feedparser.parse("https://example.com/feed.xml")
|
||||
|
||||
# Convert each entry
|
||||
for entry in feed.entries:
|
||||
# Save entry HTML
|
||||
with open("temp.html", "w") as f:
|
||||
f.write(entry.summary)
|
||||
|
||||
result = md.convert("temp.html")
|
||||
print(f"## {entry.title}\n\n{result.text_content}\n\n")
|
||||
```
|
||||
|
||||
## Combined Web Content Workflows
|
||||
|
||||
### Web Scraping Pipeline
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def scrape_and_convert(url):
|
||||
"""Scrape webpage and convert to Markdown."""
|
||||
response = requests.get(url)
|
||||
soup = BeautifulSoup(response.content, 'html.parser')
|
||||
|
||||
# Extract main content
|
||||
main_content = soup.find('article') or soup.find('main')
|
||||
|
||||
if main_content:
|
||||
# Save HTML
|
||||
with open("temp.html", "w") as f:
|
||||
f.write(str(main_content))
|
||||
|
||||
# Convert to Markdown
|
||||
result = md.convert("temp.html")
|
||||
return result.text_content
|
||||
|
||||
return None
|
||||
|
||||
# Use it
|
||||
markdown = scrape_and_convert("https://example.com/article")
|
||||
print(markdown)
|
||||
```
|
||||
|
||||
### YouTube Learning Content Extraction
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Course videos
|
||||
course_videos = [
|
||||
("https://youtube.com/watch?v=ID1", "Lesson 1: Introduction"),
|
||||
("https://youtube.com/watch?v=ID2", "Lesson 2: Basics"),
|
||||
("https://youtube.com/watch?v=ID3", "Lesson 3: Advanced")
|
||||
]
|
||||
|
||||
course_content = []
|
||||
for url, title in course_videos:
|
||||
result = md.convert(url)
|
||||
course_content.append(f"# {title}\n\n{result.text_content}")
|
||||
|
||||
# Combine into course document
|
||||
full_course = "\n\n---\n\n".join(course_content)
|
||||
with open("course_transcript.md", "w") as f:
|
||||
f.write(full_course)
|
||||
```
|
||||
|
||||
### Documentation Scraping
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import requests
|
||||
from urllib.parse import urljoin, urlparse
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def scrape_documentation(base_url, page_urls):
|
||||
"""Scrape multiple documentation pages."""
|
||||
docs = []
|
||||
|
||||
for page_url in page_urls:
|
||||
full_url = urljoin(base_url, page_url)
|
||||
|
||||
# Fetch page
|
||||
response = requests.get(full_url)
|
||||
with open("temp.html", "wb") as f:
|
||||
f.write(response.content)
|
||||
|
||||
# Convert
|
||||
result = md.convert("temp.html")
|
||||
docs.append({
|
||||
'url': full_url,
|
||||
'content': result.text_content
|
||||
})
|
||||
|
||||
return docs
|
||||
|
||||
# Example usage
|
||||
base = "https://docs.example.com/"
|
||||
pages = ["intro.html", "getting-started.html", "api.html"]
|
||||
documentation = scrape_documentation(base, pages)
|
||||
```
|
||||
|
||||
### EPUB Library Processing
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
import os
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
def process_epub_library(library_path, output_path):
|
||||
"""Convert all EPUB books in a directory."""
|
||||
for filename in os.listdir(library_path):
|
||||
if filename.endswith('.epub'):
|
||||
epub_path = os.path.join(library_path, filename)
|
||||
|
||||
try:
|
||||
result = md.convert(epub_path)
|
||||
|
||||
# Save markdown
|
||||
output_file = filename.replace('.epub', '.md')
|
||||
output_full = os.path.join(output_path, output_file)
|
||||
|
||||
with open(output_full, 'w') as f:
|
||||
f.write(result.text_content)
|
||||
|
||||
print(f"Converted: {filename}")
|
||||
except Exception as e:
|
||||
print(f"Failed to convert {filename}: {e}")
|
||||
|
||||
# Process library
|
||||
process_epub_library("books", "markdown_books")
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### HTML Conversion Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("webpage.html")
|
||||
print(result.text_content)
|
||||
except FileNotFoundError:
|
||||
print("HTML file not found")
|
||||
except Exception as e:
|
||||
print(f"Conversion error: {e}")
|
||||
```
|
||||
|
||||
### YouTube Transcript Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"Failed to get transcript: {e}")
|
||||
# Common issues: No transcript available, video unavailable, network error
|
||||
```
|
||||
|
||||
### EPUB Conversion Errors
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
try:
|
||||
result = md.convert("book.epub")
|
||||
print(result.text_content)
|
||||
except Exception as e:
|
||||
print(f"EPUB processing error: {e}")
|
||||
# Common issues: Corrupted file, unsupported DRM, invalid format
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### HTML Processing
|
||||
- Clean HTML before conversion for better results
|
||||
- Use readability libraries to extract main content
|
||||
- Handle different encodings appropriately
|
||||
- Remove unnecessary markup
|
||||
|
||||
### YouTube Processing
|
||||
- Check transcript availability before batch processing
|
||||
- Handle API rate limits gracefully
|
||||
- Store transcripts to avoid re-fetching
|
||||
- Respect YouTube's terms of service
|
||||
|
||||
### EPUB Processing
|
||||
- DRM-protected EPUBs cannot be processed
|
||||
- Large EPUBs may require more memory
|
||||
- Some formatting may not translate perfectly
|
||||
- Test with representative samples first
|
||||
|
||||
### Web Scraping Ethics
|
||||
- Respect robots.txt
|
||||
- Add delays between requests
|
||||
- Identify your scraper in User-Agent
|
||||
- Cache results to minimize requests
|
||||
- Follow website terms of service
|
||||
Reference in New Issue
Block a user