Initial commit

2025-11-30 08:32:01 +08:00
commit e7f6367823
8 changed files with 2219 additions and 0 deletions
--- a/skills/article-extractor/SKILL.md
+++ b/skills/article-extractor/SKILL.md
@@ -0,0 +1,371 @@
+---
+name: article-extractor
+description: Extract clean article content from URLs (blog posts, articles, tutorials) and save as readable text. Use when user wants to download, extract, or save an article/blog post from a URL without ads, navigation, or clutter.
+allowed-tools:
+  - Bash
+  - Write
+---
+
+# Article Extractor
+
+This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.
+
+## When to Use This Skill
+
+Activate when the user:
+- Provides an article/blog URL and wants the text content
+- Asks to "download this article"
+- Wants to "extract the content from [URL]"
+- Asks to "save this blog post as text"
+- Needs clean article text without distractions
+
+## How It Works
+
+### Priority Order:
+1. **Check if tools are installed** (reader or trafilatura)
+2. **Download and extract article** using best available tool
+3. **Clean up the content** (remove extra whitespace, format properly)
+4. **Save to file** with article title as filename
+5. **Confirm location** and show preview
+
+## Installation Check
+
+Check for article extraction tools in this order:
+
+### Option 1: reader (Recommended - Mozilla's Readability)
+
+```bash
+command -v reader
+```
+
+If not installed:
+```bash
+npm install -g @mozilla/readability-cli
+# or
+npm install -g reader-cli
+```
+
+### Option 2: trafilatura (Python-based, very good)
+
+```bash
+command -v trafilatura
+```
+
+If not installed:
+```bash
+pip3 install trafilatura
+```
+
+### Option 3: Fallback (curl + simple parsing)
+
+If no tools available, use basic curl + text extraction (less reliable but works)
+
+## Extraction Methods
+
+### Method 1: Using reader (Best for most articles)
+
+```bash
+# Extract article
+reader "URL" > article.txt
+```
+
+**Pros:**
+- Based on Mozilla's Readability algorithm
+- Excellent at removing clutter
+- Preserves article structure
+
+### Method 2: Using trafilatura (Best for blogs/news)
+
+```bash
+# Extract article
+trafilatura --URL "URL" --output-format txt > article.txt
+
+# Or with more options
+trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt
+```
+
+**Pros:**
+- Very accurate extraction
+- Good with various site structures
+- Handles multiple languages
+
+**Options:**
+- `--no-comments`: Skip comment sections
+- `--no-tables`: Skip data tables
+- `--precision`: Favor precision over recall
+- `--recall`: Extract more content (may include some noise)
+
+### Method 3: Fallback (curl + basic parsing)
+
+```bash
+# Download and extract basic content
+curl -s "URL" | python3 -c "
+from html.parser import HTMLParser
+import sys
+
+class ArticleExtractor(HTMLParser):
+    def __init__(self):
+        super().__init__()
+        self.in_content = False
+        self.content = []
+        self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
+        self.current_tag = None
+
+    def handle_starttag(self, tag, attrs):
+        if tag not in self.skip_tags:
+            if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
+                self.in_content = True
+        self.current_tag = tag
+
+    def handle_data(self, data):
+        if self.in_content and data.strip():
+            self.content.append(data.strip())
+
+    def get_content(self):
+        return '\n\n'.join(self.content)
+
+parser = ArticleExtractor()
+parser.feed(sys.stdin.read())
+print(parser.get_content())
+" > article.txt
+```
+
+**Note:** This is less reliable but works without dependencies.
+
+## Getting Article Title
+
+Extract title for filename:
+
+### Using reader:
+```bash
+# reader outputs markdown with title at top
+TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
+```
+
+### Using trafilatura:
+```bash
+# Get metadata including title
+TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
+```
+
+### Using curl (fallback):
+```bash
+TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')
+```
+
+## Filename Creation
+
+Clean title for filesystem:
+
+```bash
+# Get title
+TITLE="Article Title from Website"
+
+# Clean for filesystem (remove special chars, limit length)
+FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')
+
+# Add extension
+FILENAME="${FILENAME}.txt"
+```
+
+## Complete Workflow
+
+```bash
+ARTICLE_URL="https://example.com/article"
+
+# Check for tools
+if command -v reader &> /dev/null; then
+    TOOL="reader"
+    echo "Using reader (Mozilla Readability)"
+elif command -v trafilatura &> /dev/null; then
+    TOOL="trafilatura"
+    echo "Using trafilatura"
+else
+    TOOL="fallback"
+    echo "Using fallback method (may be less accurate)"
+fi
+
+# Extract article
+case $TOOL in
+    reader)
+        # Get content
+        reader "$ARTICLE_URL" > temp_article.txt
+
+        # Get title (first line after # in markdown)
+        TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
+        ;;
+
+    trafilatura)
+        # Get title from metadata
+        METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
+        TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")
+
+        # Get clean content
+        trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
+        ;;
+
+    fallback)
+        # Get title
+        TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
+        TITLE=${TITLE%% - *}  # Remove site name
+        TITLE=${TITLE%% | *}  # Remove site name (alternate)
+
+        # Get content (basic extraction)
+        curl -s "$ARTICLE_URL" | python3 -c "
+from html.parser import HTMLParser
+import sys
+
+class ArticleExtractor(HTMLParser):
+    def __init__(self):
+        super().__init__()
+        self.in_content = False
+        self.content = []
+        self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}
+
+    def handle_starttag(self, tag, attrs):
+        if tag not in self.skip_tags:
+            if tag in {'p', 'article', 'main'}:
+                self.in_content = True
+        if tag in {'h1', 'h2', 'h3'}:
+            self.content.append('\n')
+
+    def handle_data(self, data):
+        if self.in_content and data.strip():
+            self.content.append(data.strip())
+
+    def get_content(self):
+        return '\n\n'.join(self.content)
+
+parser = ArticleExtractor()
+parser.feed(sys.stdin.read())
+print(parser.get_content())
+" > temp_article.txt
+        ;;
+esac
+
+# Clean filename
+FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
+FILENAME="${FILENAME}.txt"
+
+# Move to final filename
+mv temp_article.txt "$FILENAME"
+
+# Show result
+echo "✓ Extracted article: $TITLE"
+echo "✓ Saved to: $FILENAME"
+echo ""
+echo "Preview (first 10 lines):"
+head -n 10 "$FILENAME"
+```
+
+## Error Handling
+
+### Common Issues
+
+**1. Tool not installed**
+- Try alternate tool (reader → trafilatura → fallback)
+- Offer to install: "Install reader with: npm install -g reader-cli"
+
+**2. Paywall or login required**
+- Extraction tools may fail
+- Inform user: "This article requires authentication. Cannot extract."
+
+**3. Invalid URL**
+- Check URL format
+- Try with and without redirects
+
+**4. No content extracted**
+- Site may use heavy JavaScript
+- Try fallback method
+- Inform user if extraction fails
+
+**5. Special characters in title**
+- Clean title for filesystem
+- Remove: `/`, `:`, `?`, `"`, `<`, `>`, `|`
+- Replace with `-` or remove
+
+## Output Format
+
+### Saved File Contains:
+- Article title (if available)
+- Author (if available from tool)
+- Main article text
+- Section headings
+- No navigation, ads, or clutter
+
+### What Gets Removed:
+- Navigation menus
+- Ads and promotional content
+- Newsletter signup forms
+- Related articles sidebars
+- Comment sections (optional)
+- Social media buttons
+- Cookie notices
+
+## Tips for Best Results
+
+**1. Use reader for most articles**
+- Best all-around tool
+- Based on Firefox Reader View
+- Works on most news sites and blogs
+
+**2. Use trafilatura for:**
+- Academic articles
+- News sites
+- Blogs with complex layouts
+- Non-English content
+
+**3. Fallback method limitations:**
+- May include some noise
+- Less accurate paragraph detection
+- Better than nothing for simple sites
+
+**4. Check extraction quality:**
+- Always show preview to user
+- Ask if it looks correct
+- Offer to try different tool if needed
+
+## Example Usage
+
+**Simple extraction:**
+```bash
+# User: "Extract https://example.com/article"
+reader "https://example.com/article" > temp.txt
+TITLE=$(head -n 1 temp.txt | sed 's/^# //')
+FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
+mv temp.txt "$FILENAME"
+echo "✓ Saved to: $FILENAME"
+```
+
+**With error handling:**
+```bash
+if ! reader "$URL" > temp.txt 2>/dev/null; then
+    if command -v trafilatura &> /dev/null; then
+        trafilatura --URL "$URL" --output-format txt > temp.txt
+    else
+        echo "Error: Could not extract article. Install reader or trafilatura."
+        exit 1
+    fi
+fi
+```
+
+## Best Practices
+
+- ✅ Always show preview after extraction (first 10 lines)
+- ✅ Verify extraction succeeded before saving
+- ✅ Clean filename for filesystem compatibility
+- ✅ Try fallback method if primary fails
+- ✅ Inform user which tool was used
+- ✅ Keep filename length reasonable (< 100 chars)
+
+## After Extraction
+
+Display to user:
+1. "✓ Extracted: [Article Title]"
+2. "✓ Saved to: [filename]"
+3. Show preview (first 10-15 lines)
+4. File size and location
+
+Ask if needed:
+- "Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
+- "Should I extract another article?"