zhongwei/gh-jeremylongshore-claude-code-plugins-plus-plugins-crypto-crypto-news-aggregator

Files

Zhongwei Li 014b9379e5 Initial commit

2025-11-29 18:53:05 +08:00

41 KiB

Raw Blame History

description, shortcut

description	shortcut
Aggregate crypto news with sentiment analysis and market impact scoring from 50+ sources	an

Aggregate Crypto News

Multi-source cryptocurrency news aggregation system with AI-powered sentiment analysis, trend detection, and market impact prediction. Monitors 50+ crypto news sources, social media platforms, and official project announcements in real-time.

Supported Sources: CoinDesk, CoinTelegraph, Decrypt, The Block, Bitcoin Magazine, CryptoSlate, Twitter/X, Reddit, Medium, Telegram, Discord, official project blogs, SEC filings, exchange announcements

When to Use This Command

Use /aggregate-news when you need to:

Monitor breaking crypto news across multiple sources in real-time
Analyze sentiment shifts before market movements
Track specific coins, projects, or topics (DeFi, NFTs, regulation)
Identify trending narratives and meme coins early
Correlate news events with price movements
Research fundamental analysis for trading decisions
Monitor regulatory developments and exchange listings
Track whale social media activity and influencer sentiment

DON'T use this command for:

Single-source news reading - Use direct RSS feeds
Historical price analysis - Use /analyze-trends instead
On-chain data analysis - Use /analyze-chain instead
Technical analysis - Use /generate-signal instead
Real-time price tracking - Use /track-price instead

Design Decisions

Why Multi-Source Aggregation? Single sources have bias and delays. Aggregating 50+ sources provides complete market coverage and cross-verification of breaking news.

Why Sentiment Analysis?

AI sentiment chosen: News sentiment predicts price movements 15-30 minutes before impact
Manual reading rejected: Impossible to read 1000+ articles/day across sources

Why Real-Time vs Hourly Digests?

Real-time chosen: Breaking news impacts prices within seconds (exchange hacks, regulations)
Hourly rejected: 60-minute delays miss trading opportunities

Why Deduplication?

Deduplication chosen: 50+ sources report same news, creates 80% redundancy
All articles kept rejected: Would generate 5000+ alerts/day

Why Market Impact Scoring?

Scoring chosen: Not all news is equally important (0-100 scale)
Equal weight rejected: Treats minor updates same as exchange hacks

Prerequisites

Before running this command, ensure you have:

News API Access (at least 3 recommended):
- NewsAPI.org key (free: 100 requests/day, paid: unlimited)
- CryptoCompare News API (free tier available)
- CoinGecko News API (no key required, rate limited)
- Alternative Data News API (premium)
- Messari News API (enterprise)
Social Media APIs:
- Twitter/X API v2 bearer token (Essential: $100/month)
- Reddit API credentials (free with rate limits)
- Telegram bot token (for channel monitoring)
- Discord bot token + OAuth2 (for server monitoring)
AI/ML Services (choose one):
- OpenAI API key (GPT-4 for sentiment: $0.03/1K tokens)
- Anthropic Claude API key (recommended: $0.015/1K tokens)
- Google Vertex AI credentials (PaLM 2)
- Local sentiment model (FinBERT, CryptoBERT)
Database (for historical tracking):
- PostgreSQL 13+ with pgvector (similarity search)
- MongoDB 5+ (document storage)
- Elasticsearch 8+ (full-text search, recommended)
Infrastructure:
- Linux server (Ubuntu 20.04+) or Docker
- 8GB RAM minimum (16GB for 50+ sources)
- 500GB storage (6 months history)
- Redis (for real-time caching and deduplication)

Implementation Process

Step 1: Configure News Sources

Create config/news_sources.json:

{
  "sources": {
    "mainstream_crypto": [
      {
        "name": "CoinDesk",
        "type": "rss",
        "url": "https://www.coindesk.com/arc/outboundfeeds/rss/",
        "impact_weight": 1.0,
        "credibility": 0.95
      },
      {
        "name": "CoinTelegraph",
        "type": "rss",
        "url": "https://cointelegraph.com/rss",
        "impact_weight": 0.9,
        "credibility": 0.90
      },
      {
        "name": "The Block",
        "type": "rss",
        "url": "https://www.theblock.co/rss.xml",
        "impact_weight": 0.95,
        "credibility": 0.92
      },
      {
        "name": "Decrypt",
        "type": "rss",
        "url": "https://decrypt.co/feed",
        "impact_weight": 0.85,
        "credibility": 0.88
      }
    ],
    "social_media": [
      {
        "name": "CryptoTwitter",
        "type": "twitter_search",
        "keywords": ["#Bitcoin", "#Ethereum", "#Crypto", "$BTC", "$ETH"],
        "min_followers": 10000,
        "verified_only": false,
        "impact_weight": 0.7,
        "credibility": 0.70
      },
      {
        "name": "CryptocurrencySubreddit",
        "type": "reddit",
        "subreddit": "CryptoCurrency",
        "min_upvotes": 100,
        "impact_weight": 0.6,
        "credibility": 0.65
      }
    ],
    "official_sources": [
      {
        "name": "SEC_Filings",
        "type": "sec_rss",
        "url": "https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&CIK=&type=&company=bitcoin&dateb=&owner=exclude&start=0&count=40&output=atom",
        "impact_weight": 1.5,
        "credibility": 1.0
      }
    ]
  },
  "fetch_intervals": {
    "high_priority": 60,
    "medium_priority": 300,
    "low_priority": 900
  },
  "content_filters": {
    "min_word_count": 50,
    "exclude_keywords": ["advertisement", "sponsored", "partner content"],
    "language": "en"
  }
}

Step 2: Set Up Sentiment Analysis

Create config/sentiment_config.json:

{
  "provider": "anthropic",
  "model": "claude-3-5-sonnet-20241022",
  "api_key_env": "ANTHROPIC_API_KEY",

  "sentiment_scale": {
    "very_bearish": -1.0,
    "bearish": -0.5,
    "neutral": 0.0,
    "bullish": 0.5,
    "very_bullish": 1.0
  },

  "entity_extraction": {
    "enabled": true,
    "categories": [
      "cryptocurrencies",
      "exchanges",
      "projects",
      "people",
      "regulations",
      "events"
    ]
  },

  "market_impact": {
    "factors": [
      "source_credibility",
      "sentiment_strength",
      "entity_relevance",
      "social_engagement",
      "breaking_news_indicator"
    ],
    "weights": {
      "source_credibility": 0.3,
      "sentiment_strength": 0.25,
      "entity_relevance": 0.2,
      "social_engagement": 0.15,
      "breaking_news_indicator": 0.1
    }
  },

  "cache_ttl_seconds": 3600,
  "batch_size": 10
}

Step 3: Initialize Database Schema

-- PostgreSQL with pgvector for semantic search
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE news_articles (
    id SERIAL PRIMARY KEY,
    article_id VARCHAR(255) UNIQUE NOT NULL,
    source VARCHAR(100) NOT NULL,
    source_type VARCHAR(50) NOT NULL,
    title TEXT NOT NULL,
    content TEXT,
    summary TEXT,
    url TEXT UNIQUE NOT NULL,
    author VARCHAR(255),
    published_at TIMESTAMP NOT NULL,
    fetched_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    language VARCHAR(10) DEFAULT 'en',

    -- Sentiment analysis
    sentiment_score FLOAT,
    sentiment_label VARCHAR(20),
    confidence FLOAT,

    -- Market impact
    market_impact_score INTEGER,
    impact_category VARCHAR(20),

    -- Engagement metrics
    social_shares INTEGER DEFAULT 0,
    comments_count INTEGER DEFAULT 0,
    engagement_score FLOAT,

    -- Vector embedding for similarity search
    embedding vector(1536),

    INDEX idx_published_at (published_at DESC),
    INDEX idx_source (source),
    INDEX idx_sentiment (sentiment_score),
    INDEX idx_impact (market_impact_score DESC)
);

CREATE TABLE article_entities (
    id SERIAL PRIMARY KEY,
    article_id INTEGER REFERENCES news_articles(id) ON DELETE CASCADE,
    entity_type VARCHAR(50) NOT NULL,
    entity_name VARCHAR(255) NOT NULL,
    entity_ticker VARCHAR(20),
    relevance_score FLOAT,
    sentiment_score FLOAT,
    INDEX idx_entity_name (entity_name),
    INDEX idx_entity_ticker (entity_ticker)
);

CREATE TABLE trending_topics (
    id SERIAL PRIMARY KEY,
    topic VARCHAR(255) UNIQUE NOT NULL,
    category VARCHAR(50),
    mention_count INTEGER DEFAULT 1,
    avg_sentiment FLOAT,
    market_impact_score INTEGER,
    first_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    is_trending BOOLEAN DEFAULT FALSE,
    INDEX idx_trending (is_trending, market_impact_score DESC)
);

CREATE TABLE article_duplicates (
    id SERIAL PRIMARY KEY,
    article_id INTEGER REFERENCES news_articles(id),
    duplicate_of INTEGER REFERENCES news_articles(id),
    similarity_score FLOAT,
    detected_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_article (article_id)
);

-- Elasticsearch index mapping (alternative/complement)
-- Run via Elasticsearch REST API
PUT /crypto_news
{
  "mappings": {
    "properties": {
      "title": {"type": "text", "analyzer": "english"},
      "content": {"type": "text", "analyzer": "english"},
      "source": {"type": "keyword"},
      "published_at": {"type": "date"},
      "sentiment_score": {"type": "float"},
      "market_impact_score": {"type": "integer"},
      "entities": {
        "type": "nested",
        "properties": {
          "name": {"type": "keyword"},
          "type": {"type": "keyword"},
          "sentiment": {"type": "float"}
        }
      }
    }
  }
}

Step 4: Run News Aggregator

Execute the aggregation script:

# Start real-time aggregation (all sources)
python3 news_aggregator.py --sources-config config/news_sources.json \
    --sentiment-config config/sentiment_config.json \
    --min-impact 40 \
    --deduplicate

# Monitor specific topics
python3 news_aggregator.py --topics "Bitcoin,Ethereum,DeFi" \
    --alert-threshold 70

# Export to webhook
python3 news_aggregator.py --webhook-url https://your-api.com/news \
    --format json \
    --interval 300

# Generate daily digest
python3 news_aggregator.py --digest daily \
    --email your@email.com \
    --top-stories 20

Step 5: Set Up Monitoring Dashboard

Create Elasticsearch/Kibana dashboard for visualization:

# docker-compose.yml
version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports:
      - "9200:9200"
    volumes:
      - es_data:/usr/share/elasticsearch/data

  kibana:
    image: docker.elastic.co/kibana/kibana:8.11.0
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data

volumes:
  es_data:
  redis_data:

Output Format

The command generates 5 output files:

1. `news_alerts_YYYYMMDD_HHMMSS.json`

Real-time high-impact news alerts:

{
  "alert_id": "alert_1634567890_coindesk_abc123",
  "timestamp": "2025-10-11T14:23:45Z",
  "article": {
    "title": "SEC Approves Bitcoin Spot ETF from BlackRock",
    "source": "CoinDesk",
    "author": "Jamie Crawley",
    "url": "https://www.coindesk.com/...",
    "published_at": "2025-10-11T14:20:00Z",
    "summary": "The U.S. Securities and Exchange Commission has approved BlackRock's application for a spot Bitcoin ETF, marking a historic moment for cryptocurrency adoption."
  },
  "sentiment_analysis": {
    "overall_sentiment": "very_bullish",
    "sentiment_score": 0.92,
    "confidence": 0.95,
    "reasoning": "Major regulatory approval from SEC, institutional adoption milestone, reduces regulatory risk, expected to increase demand significantly."
  },
  "market_impact": {
    "impact_score": 98,
    "impact_category": "CRITICAL",
    "affected_assets": [
      {
        "ticker": "BTC",
        "name": "Bitcoin",
        "expected_impact": "very_positive",
        "reasoning": "Direct positive impact - increased institutional demand, reduced regulatory uncertainty"
      },
      {
        "ticker": "ETH",
        "name": "Ethereum",
        "expected_impact": "positive",
        "reasoning": "Indirect positive - sets precedent for ETH ETF approval"
      }
    ],
    "predicted_price_movement": {
      "direction": "up",
      "magnitude": "high",
      "timeframe": "immediate_to_1_week",
      "confidence": 0.88
    }
  },
  "entities_mentioned": [
    {"name": "SEC", "type": "regulator", "sentiment": 0.8},
    {"name": "BlackRock", "type": "institution", "sentiment": 0.9},
    {"name": "Bitcoin", "type": "cryptocurrency", "sentiment": 0.95}
  ],
  "social_engagement": {
    "twitter_mentions": 15234,
    "reddit_upvotes": 8945,
    "trending_score": 95
  }
}

2. `daily_summary_YYYYMMDD.json`

Daily news summary with trends:

{
  "date": "2025-10-11",
  "summary": {
    "total_articles": 1247,
    "unique_sources": 52,
    "avg_sentiment": 0.23,
    "sentiment_distribution": {
      "very_bearish": 89,
      "bearish": 234,
      "neutral": 567,
      "bullish": 289,
      "very_bullish": 68
    }
  },
  "trending_topics": [
    {
      "topic": "Bitcoin ETF Approval",
      "mentions": 342,
      "avg_sentiment": 0.87,
      "impact_score": 95,
      "related_tickers": ["BTC", "ETH"],
      "trend_direction": "surging"
    },
    {
      "topic": "Ethereum Shanghai Upgrade",
      "mentions": 198,
      "avg_sentiment": 0.65,
      "impact_score": 78,
      "related_tickers": ["ETH"],
      "trend_direction": "rising"
    }
  ],
  "sentiment_shifts": [
    {
      "asset": "BTC",
      "previous_24h_sentiment": 0.12,
      "current_sentiment": 0.68,
      "shift": 0.56,
      "significance": "major_positive_shift"
    }
  ],
  "top_sources": [
    {"source": "CoinDesk", "articles": 87, "avg_impact": 72},
    {"source": "The Block", "articles": 65, "avg_impact": 78},
    {"source": "CoinTelegraph", "articles": 134, "avg_impact": 65}
  ]
}

3. `trending_narratives_YYYYMMDD.csv`

Trending narratives and memes:

rank,narrative,mentions_24h,mentions_7d,growth_rate,avg_sentiment,related_coins,sample_headlines
1,"Bitcoin ETF Approval",342,892,283%,0.87,"BTC,ETH","SEC Approves BlackRock Bitcoin ETF; Historic Day for Crypto; ETF Approval Sends BTC to $50K"
2,"AI Crypto Projects",215,456,112%,0.72,"FET,AGIX,RNDR","AI Tokens Surge 40% on ChatGPT Integration; Fetch.ai Partners with Bosch"
3,"Memecoin Season",189,1234,53%,0.45,"DOGE,SHIB,PEPE","New Memecoin PEPE2 Explodes 300%; Dogecoin Whale Activity Spikes"

4. `entity_sentiment_tracker.json`

Per-entity sentiment tracking:

{
  "timestamp": "2025-10-11T14:23:45Z",
  "entities": {
    "cryptocurrencies": [
      {
        "ticker": "BTC",
        "name": "Bitcoin",
        "mentions_24h": 3421,
        "sentiment_score": 0.68,
        "sentiment_trend": "strongly_improving",
        "market_impact_score": 85,
        "top_keywords": ["ETF", "approval", "institutional", "adoption"]
      },
      {
        "ticker": "ETH",
        "name": "Ethereum",
        "mentions_24h": 1876,
        "sentiment_score": 0.52,
        "sentiment_trend": "improving",
        "market_impact_score": 72,
        "top_keywords": ["Shanghai", "staking", "upgrade", "Layer2"]
      }
    ],
    "exchanges": [
      {
        "name": "Binance",
        "mentions_24h": 567,
        "sentiment_score": -0.23,
        "sentiment_trend": "declining",
        "reason": "Regulatory concerns, DOJ investigation news"
      }
    ],
    "regulations": [
      {
        "topic": "SEC Crypto Policy",
        "mentions_24h": 892,
        "sentiment_score": 0.45,
        "sentiment_trend": "improving",
        "reason": "ETF approval signals more favorable stance"
      }
    ]
  }
}

5. `market_moving_events.json`

Critical market-moving events detected:

{
  "date": "2025-10-11",
  "critical_events": [
    {
      "event_id": "evt_20251011_001",
      "title": "SEC Approves Bitcoin Spot ETF",
      "category": "regulation",
      "impact_score": 98,
      "detected_at": "2025-10-11T14:20:45Z",
      "price_impact_observed": {
        "BTC": {
          "price_before": 43250,
          "price_15min_after": 47800,
          "change_pct": 10.5,
          "volume_increase_pct": 340
        }
      },
      "news_velocity": {
        "articles_first_hour": 87,
        "sources_reporting": 45,
        "social_mentions_first_hour": 234567
      }
    }
  ]
}

Code Example 1: Core News Aggregator (Python)

#!/usr/bin/env python3
"""
Production-grade cryptocurrency news aggregator with AI sentiment analysis.
Supports 50+ news sources, real-time monitoring, and market impact scoring.
"""

import asyncio
import hashlib
import json
import logging
import re
import time
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Set
from dataclasses import dataclass, asdict
from urllib.parse import urlparse

import aiohttp
import feedparser
import psycopg2
from psycopg2.extras import execute_batch
import redis
from anthropic import AsyncAnthropic
from elasticsearch import AsyncElasticsearch
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)


@dataclass
class NewsArticle:
    """Represents a news article with all metadata."""
    article_id: str
    source: str
    source_type: str
    title: str
    content: str
    summary: str
    url: str
    author: Optional[str]
    published_at: datetime
    fetched_at: datetime

    # Sentiment
    sentiment_score: float = 0.0
    sentiment_label: str = "neutral"
    confidence: float = 0.0

    # Market impact
    market_impact_score: int = 0
    impact_category: str = "low"

    # Engagement
    social_shares: int = 0
    comments_count: int = 0
    engagement_score: float = 0.0

    def to_dict(self) -> Dict:
        """Convert to dictionary for JSON serialization."""
        d = asdict(self)
        d['published_at'] = self.published_at.isoformat()
        d['fetched_at'] = self.fetched_at.isoformat()
        return d


class NewsScraper:
    """Fetch news from multiple sources."""

    def __init__(self, sources_config: Dict):
        self.sources_config = sources_config
        self.session: Optional[aiohttp.ClientSession] = None

    async def __aenter__(self):
        self.session = aiohttp.ClientSession()
        return self

    async def __aexit__(self, exc_type, exc_val, exc_tb):
        if self.session:
            await self.session.close()

    async def fetch_all_sources(self) -> List[NewsArticle]:
        """Fetch from all configured sources concurrently."""
        tasks = []

        for category, sources in self.sources_config['sources'].items():
            for source in sources:
                if source['type'] == 'rss':
                    tasks.append(self._fetch_rss(source))
                elif source['type'] == 'twitter_search':
                    tasks.append(self._fetch_twitter(source))
                elif source['type'] == 'reddit':
                    tasks.append(self._fetch_reddit(source))

        results = await asyncio.gather(*tasks, return_exceptions=True)

        articles = []
        for result in results:
            if isinstance(result, Exception):
                logger.error(f"Source fetch failed: {result}")
            elif result:
                articles.extend(result)

        return articles

    async def _fetch_rss(self, source: Dict) -> List[NewsArticle]:
        """Fetch articles from RSS feed."""
        articles = []

        try:
            async with self.session.get(source['url'], timeout=30) as response:
                if response.status != 200:
                    logger.error(f"RSS fetch failed for {source['name']}: {response.status}")
                    return articles

                content = await response.text()
                feed = feedparser.parse(content)

                for entry in feed.entries:
                    article = NewsArticle(
                        article_id=self._generate_article_id(entry.link),
                        source=source['name'],
                        source_type='rss',
                        title=entry.title,
                        content=entry.get('summary', ''),
                        summary=entry.get('summary', '')[:500],
                        url=entry.link,
                        author=entry.get('author'),
                        published_at=self._parse_published_date(entry),
                        fetched_at=datetime.utcnow()
                    )
                    articles.append(article)

        except Exception as e:
            logger.error(f"Error fetching RSS from {source['name']}: {e}")

        return articles

    async def _fetch_twitter(self, source: Dict) -> List[NewsArticle]:
        """Fetch tweets from Twitter API."""
        # Simplified - production would use full Twitter API v2 integration
        articles = []

        try:
            # Twitter API v2 search endpoint
            bearer_token = os.getenv('TWITTER_BEARER_TOKEN')
            headers = {'Authorization': f'Bearer {bearer_token}'}

            query = ' OR '.join(source['keywords'])
            url = f"https://api.twitter.com/2/tweets/search/recent?query={query}&max_results=100"

            async with self.session.get(url, headers=headers, timeout=30) as response:
                if response.status != 200:
                    logger.error(f"Twitter fetch failed: {response.status}")
                    return articles

                data = await response.json()

                for tweet in data.get('data', []):
                    article = NewsArticle(
                        article_id=self._generate_article_id(tweet['id']),
                        source=source['name'],
                        source_type='twitter',
                        title=tweet['text'][:100],
                        content=tweet['text'],
                        summary=tweet['text'][:200],
                        url=f"https://twitter.com/i/web/status/{tweet['id']}",
                        author=None,  # Would fetch from includes.users
                        published_at=datetime.fromisoformat(tweet['created_at'].replace('Z', '+00:00')),
                        fetched_at=datetime.utcnow()
                    )
                    articles.append(article)

        except Exception as e:
            logger.error(f"Error fetching Twitter: {e}")

        return articles

    async def _fetch_reddit(self, source: Dict) -> List[NewsArticle]:
        """Fetch posts from Reddit."""
        articles = []

        try:
            # Reddit API
            url = f"https://www.reddit.com/r/{source['subreddit']}/hot.json?limit=100"
            headers = {'User-Agent': 'CryptoNewsAggregator/1.0'}

            async with self.session.get(url, headers=headers, timeout=30) as response:
                if response.status != 200:
                    logger.error(f"Reddit fetch failed: {response.status}")
                    return articles

                data = await response.json()

                for post in data['data']['children']:
                    post_data = post['data']

                    if post_data['ups'] < source.get('min_upvotes', 0):
                        continue

                    article = NewsArticle(
                        article_id=self._generate_article_id(post_data['id']),
                        source=f"r/{source['subreddit']}",
                        source_type='reddit',
                        title=post_data['title'],
                        content=post_data.get('selftext', ''),
                        summary=post_data.get('selftext', '')[:500],
                        url=f"https://reddit.com{post_data['permalink']}",
                        author=post_data['author'],
                        published_at=datetime.fromtimestamp(post_data['created_utc']),
                        fetched_at=datetime.utcnow(),
                        social_shares=post_data['ups'],
                        comments_count=post_data['num_comments']
                    )
                    articles.append(article)

        except Exception as e:
            logger.error(f"Error fetching Reddit: {e}")

        return articles

    def _generate_article_id(self, identifier: str) -> str:
        """Generate unique article ID from URL or identifier."""
        return hashlib.md5(identifier.encode()).hexdigest()

    def _parse_published_date(self, entry) -> datetime:
        """Parse published date from feed entry."""
        if hasattr(entry, 'published_parsed') and entry.published_parsed:
            return datetime(*entry.published_parsed[:6])
        return datetime.utcnow()


class SentimentAnalyzer:
    """AI-powered sentiment analysis for crypto news."""

    def __init__(self, config: Dict):
        self.config = config
        self.client = AsyncAnthropic(api_key=os.getenv(config['api_key_env']))
        self.model = config['model']
        self.cache: Dict[str, Dict] = {}

    async def analyze_batch(self, articles: List[NewsArticle]) -> List[NewsArticle]:
        """Analyze sentiment for batch of articles."""
        tasks = []

        for article in articles:
            # Check cache first
            cache_key = article.article_id
            if cache_key in self.cache:
                cached = self.cache[cache_key]
                if time.time() - cached['timestamp'] < self.config['cache_ttl_seconds']:
                    article.sentiment_score = cached['sentiment_score']
                    article.sentiment_label = cached['sentiment_label']
                    article.confidence = cached['confidence']
                    continue

            tasks.append(self._analyze_single(article))

        results = await asyncio.gather(*tasks, return_exceptions=True)

        analyzed_articles = []
        for i, result in enumerate(results):
            if isinstance(result, Exception):
                logger.error(f"Sentiment analysis failed: {result}")
                analyzed_articles.append(articles[i])
            else:
                analyzed_articles.append(result)

        return analyzed_articles

    async def _analyze_single(self, article: NewsArticle) -> NewsArticle:
        """Analyze sentiment for single article using Claude."""
        prompt = f"""Analyze the sentiment of this cryptocurrency news article and its potential market impact.

Title: {article.title}

Content: {article.content[:2000]}

Provide your analysis in JSON format:
{{
  "sentiment": "very_bearish|bearish|neutral|bullish|very_bullish",
  "sentiment_score": <float between -1.0 and 1.0>,
  "confidence": <float between 0.0 and 1.0>,
  "market_impact_score": <integer 0-100>,
  "reasoning": "<brief explanation>",
  "entities": [
    {{"name": "<entity>", "type": "<type>", "sentiment": <score>}}
  ]
}}

Consider:
- Regulatory developments (high impact)
- Exchange listings/delistings (medium-high impact)
- Major partnerships (medium impact)
- Technical upgrades (medium impact)
- General market commentary (low impact)
"""

        try:
            response = await self.client.messages.create(
                model=self.model,
                max_tokens=1000,
                messages=[{"role": "user", "content": prompt}]
            )

            # Extract JSON from response
            content = response.content[0].text
            json_match = re.search(r'\{.*\}', content, re.DOTALL)
            if json_match:
                analysis = json.loads(json_match.group())

                article.sentiment_score = analysis['sentiment_score']
                article.sentiment_label = analysis['sentiment']
                article.confidence = analysis['confidence']
                article.market_impact_score = analysis['market_impact_score']

                # Cache result
                self.cache[article.article_id] = {
                    'sentiment_score': article.sentiment_score,
                    'sentiment_label': article.sentiment_label,
                    'confidence': article.confidence,
                    'timestamp': time.time()
                }

        except Exception as e:
            logger.error(f"Sentiment analysis failed for {article.article_id}: {e}")

        return article


class Deduplicator:
    """Detect and remove duplicate articles."""

    def __init__(self, redis_client: redis.Redis, similarity_threshold: float = 0.85):
        self.redis = redis_client
        self.similarity_threshold = similarity_threshold

    def deduplicate(self, articles: List[NewsArticle]) -> List[NewsArticle]:
        """Remove duplicate articles based on title similarity."""
        unique_articles = []
        seen_titles = set()

        for article in articles:
            # Check Redis cache for seen URL
            if self.redis.exists(f"article:{article.url}"):
                logger.debug(f"Duplicate URL detected: {article.url}")
                continue

            # Check title similarity
            is_duplicate = False
            normalized_title = self._normalize_title(article.title)

            for seen_title in seen_titles:
                similarity = self._calculate_similarity(normalized_title, seen_title)
                if similarity >= self.similarity_threshold:
                    is_duplicate = True
                    logger.debug(f"Duplicate title detected: {article.title} (similarity: {similarity:.2f})")
                    break

            if not is_duplicate:
                unique_articles.append(article)
                seen_titles.add(normalized_title)

                # Cache URL in Redis (24h TTL)
                self.redis.setex(f"article:{article.url}", 86400, "1")

        logger.info(f"Deduplicated {len(articles)} articles to {len(unique_articles)} unique")
        return unique_articles

    def _normalize_title(self, title: str) -> str:
        """Normalize title for comparison."""
        # Remove special characters, lowercase, remove extra spaces
        title = re.sub(r'[^\w\s]', '', title.lower())
        title = re.sub(r'\s+', ' ', title).strip()
        return title

    def _calculate_similarity(self, title1: str, title2: str) -> float:
        """Calculate Jaccard similarity between titles."""
        words1 = set(title1.split())
        words2 = set(title2.split())

        if not words1 or not words2:
            return 0.0

        intersection = words1.intersection(words2)
        union = words1.union(words2)

        return len(intersection) / len(union)


class CryptoNewsAggregator:
    """Main news aggregation orchestrator."""

    def __init__(self, sources_config_path: str, sentiment_config_path: str):
        with open(sources_config_path) as f:
            self.sources_config = json.load(f)

        with open(sentiment_config_path) as f:
            self.sentiment_config = json.load(f)

        self.redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)
        self.scraper = NewsScraper(self.sources_config)
        self.sentiment_analyzer = SentimentAnalyzer(self.sentiment_config)
        self.deduplicator = Deduplicator(self.redis_client)

        # Database connection
        self.db_conn = psycopg2.connect(
            host=os.getenv('POSTGRES_HOST', 'localhost'),
            database=os.getenv('POSTGRES_DB', 'crypto_news'),
            user=os.getenv('POSTGRES_USER', 'postgres'),
            password=os.getenv('POSTGRES_PASSWORD')
        )

    async def run_aggregation_cycle(self) -> Dict:
        """Run single aggregation cycle."""
        logger.info("Starting news aggregation cycle")

        # Step 1: Fetch from all sources
        async with self.scraper:
            raw_articles = await self.scraper.fetch_all_sources()

        logger.info(f"Fetched {len(raw_articles)} articles from all sources")

        # Step 2: Deduplicate
        unique_articles = self.deduplicator.deduplicate(raw_articles)

        # Step 3: Sentiment analysis
        analyzed_articles = await self.sentiment_analyzer.analyze_batch(unique_articles)

        # Step 4: Filter by minimum impact score
        min_impact = self.sentiment_config.get('min_impact_score', 40)
        high_impact_articles = [
            a for a in analyzed_articles
            if a.market_impact_score >= min_impact
        ]

        logger.info(f"Found {len(high_impact_articles)} high-impact articles (score >= {min_impact})")

        # Step 5: Store in database
        self._store_articles(analyzed_articles)

        # Step 6: Generate alerts for critical news
        critical_articles = [a for a in analyzed_articles if a.market_impact_score >= 80]
        if critical_articles:
            await self._send_alerts(critical_articles)

        return {
            'total_fetched': len(raw_articles),
            'unique_articles': len(unique_articles),
            'analyzed_articles': len(analyzed_articles),
            'high_impact': len(high_impact_articles),
            'critical_alerts': len(critical_articles)
        }

    def _store_articles(self, articles: List[NewsArticle]) -> None:
        """Store articles in PostgreSQL."""
        with self.db_conn.cursor() as cur:
            for article in articles:
                try:
                    cur.execute("""
                        INSERT INTO news_articles (
                            article_id, source, source_type, title, content, summary, url,
                            author, published_at, fetched_at, sentiment_score, sentiment_label,
                            confidence, market_impact_score, social_shares, comments_count
                        ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
                        ON CONFLICT (article_id) DO UPDATE SET
                            sentiment_score = EXCLUDED.sentiment_score,
                            market_impact_score = EXCLUDED.market_impact_score
                    """, (
                        article.article_id, article.source, article.source_type,
                        article.title, article.content, article.summary, article.url,
                        article.author, article.published_at, article.fetched_at,
                        article.sentiment_score, article.sentiment_label, article.confidence,
                        article.market_impact_score, article.social_shares, article.comments_count
                    ))
                except Exception as e:
                    logger.error(f"Error storing article {article.article_id}: {e}")

        self.db_conn.commit()
        logger.info(f"Stored {len(articles)} articles in database")

    async def _send_alerts(self, articles: List[NewsArticle]) -> None:
        """Send alerts for critical news."""
        for article in articles:
            logger.warning(f"CRITICAL NEWS ALERT: {article.title} (impact: {article.market_impact_score})")
            # Implement webhook/Slack/Discord alerting here


async def main():
    """Main entry point."""
    import argparse

    parser = argparse.ArgumentParser(description='Aggregate cryptocurrency news with sentiment analysis')
    parser.add_argument('--sources-config', default='config/news_sources.json')
    parser.add_argument('--sentiment-config', default='config/sentiment_config.json')
    parser.add_argument('--interval', type=int, default=300, help='Fetch interval in seconds')

    args = parser.parse_args()

    aggregator = CryptoNewsAggregator(args.sources_config, args.sentiment_config)

    while True:
        try:
            stats = await aggregator.run_aggregation_cycle()
            logger.info(f"Cycle complete: {stats}")

            await asyncio.sleep(args.interval)

        except KeyboardInterrupt:
            logger.info("Shutting down...")
            break
        except Exception as e:
            logger.error(f"Error in aggregation cycle: {e}")
            await asyncio.sleep(60)


if __name__ == '__main__':
    asyncio.run(main())

Error Handling

Error Type	Detection	Resolution	Prevention
RSS feed timeout	Connection timeout (30s)	Skip source, continue with others	Implement per-source timeout, retry mechanism
API rate limiting (Twitter)	HTTP 429 response	Exponential backoff (1min, 5min, 15min)	Track rate limits, stagger requests
Sentiment API failure	HTTP 5xx or timeout	Use fallback rule-based sentiment	Implement circuit breaker, local model fallback
Database connection lost	psycopg2.OperationalError	Reconnect with exponential backoff	Connection pooling, health checks
Duplicate article detection	Identical URL or 85%+ title similarity	Skip article, log duplicate	Redis caching with 24h TTL
Invalid RSS/JSON	Parsing exception	Log error, skip source	Validate feed structure before parsing
Memory overflow	RSS feed >100MB	Stream parsing, limit entries	Limit feed size, process in batches
Stale feed data	Last update >24h old	Alert admin, skip source temporarily	Monitor feed freshness, automated source health checks

Configuration Options

# config/news_aggregator.yml
sources:
  fetch_interval_seconds: 300
  timeout_seconds: 30
  max_articles_per_source: 100

  credibility_weights:
    tier_1: 1.0    # CoinDesk, The Block, Bloomberg
    tier_2: 0.8    # CoinTelegraph, Decrypt
    tier_3: 0.6    # Social media, blogs

sentiment:
  provider: anthropic|openai|local
  batch_size: 10
  cache_ttl: 3600
  min_confidence: 0.5

deduplication:
  enabled: true
  similarity_threshold: 0.85
  cache_ttl_hours: 24

market_impact:
  min_alert_score: 80
  score_components:
    source_credibility: 0.30
    sentiment_strength: 0.25
    entity_relevance: 0.20
    social_engagement: 0.15
    breaking_news: 0.10

alerts:
  channels:
    - slack
    - discord
    - email
  min_impact_score: 80
  rate_limit_per_hour: 20

storage:
  database: postgresql
  retention_days: 180
  enable_elasticsearch: true

performance:
  max_concurrent_fetches: 20
  redis_cache: true
  async_processing: true

Best Practices

DO:

✅ Aggregate from 20+ sources minimum for comprehensive coverage
✅ Implement deduplication (80% of news is duplicated across sources)
✅ Use AI sentiment analysis (>80% accuracy with GPT-4/Claude)
✅ Cache sentiment results (1h TTL) to reduce API costs
✅ Monitor source reliability and adjust weights dynamically
✅ Store historical data for backtesting sentiment signals
✅ Implement rate limiting per source to avoid bans
✅ Use Redis for real-time caching and deduplication
✅ Alert only on high-impact news (score >80) to avoid fatigue
✅ Track correlation between sentiment shifts and price movements

DON'T:

❌ Rely on single news source - creates blind spots
❌ Skip sentiment analysis - raw news requires manual interpretation
❌ Process articles synchronously - too slow for real-time
❌ Store unlimited history - database bloat (180 days max)
❌ Alert on every article - causes notification fatigue
❌ Ignore social media - often breaks news before mainstream
❌ Trust sentiment without context - consider source credibility
❌ Forget deduplication - wastes 80% of processing
❌ Use outdated articles - mark stale news (>24h) clearly
❌ Hardcode source URLs - sources change frequently

Performance Considerations

Fetch Latency: Async concurrent fetching - 50 sources in <30s
Sentiment Processing: Batch processing (10 articles) - ~5s per batch with Claude
Deduplication: Redis O(1) lookup - <1ms per article
Database Writes: Batch inserts (100 articles) - ~200ms
Memory Usage: ~4GB for 10K articles in memory + Redis cache
API Costs:
- Claude Sentiment: $0.015/1K tokens = ~$0.002 per article
- Twitter API: $100/month for 2M tweets
- NewsAPI: Free tier sufficient for testing

Optimization Tips:

Use Redis for hot cache (recent articles, sentiment results)
Implement connection pooling for database (10-20 connections)
Process sentiment in batches of 10 to maximize throughput
Use Elasticsearch for full-text search instead of PostgreSQL LIKE queries
Archive articles >90 days to cold storage (S3/Glacier)

Security Considerations

API Key Management: Store in environment variables, rotate quarterly
Rate Limiting: Respect source rate limits to avoid bans
Input Validation: Sanitize all HTML/text before storage (XSS prevention)
URL Validation: Verify URLs before storage, block malicious domains
Database Access: Read-only user for analytics, write user for aggregator only
Content Filtering: Filter spam, advertisements, malicious links
Error Disclosure: Don't expose internal system details in logs
Webhook Security: Use HMAC signatures for alert webhooks
Compliance: GDPR-compliant data retention (180 days), user data anonymization
Monitoring Access: Require authentication for dashboards

/analyze-sentiment - Deep sentiment analysis for specific topics
/monitor-whales - Track whale transactions correlated with news
/generate-signal - Trading signals incorporating news sentiment
/track-price - Price tracking with news overlay
/analyze-chain - On-chain metrics correlation with news events
/scan-movers - Market movers detection with news attribution

Version History

v1.0.0 (2025-10-11) - Initial release with 50+ sources, AI sentiment
v1.1.0 (planned) - ML-based trend prediction, narrative clustering
v1.2.0 (planned) - Real-time event extraction, entity relationship mapping
v2.0.0 (planned) - Predictive market impact modeling, automated trading integration

41 KiB Raw Blame History