gh-tachyon-beep-skillpacks-…/skills/using-neural-architectures/sequence-models-comparison.md


# Sequence Models Comparison: Choosing the Right Architecture for Sequential Data

<CRITICAL_CONTEXT>
Sequence modeling has evolved rapidly:
- 2014-2017: LSTM/GRU dominated
- 2017+: Transformers revolutionized the field
- 2018+: TCN emerged as efficient alternative
- 2021+: Sparse Transformers for very long sequences
- 2022+: State Space Models (S4) for extreme lengths

Don't default to LSTM (outdated) or blindly use Transformers (not always appropriate).
Match architecture to your sequence characteristics.
</CRITICAL_CONTEXT>

## When to Use This Skill

Use this skill when:
- ✅ Selecting model for sequential/temporal data
- ✅ Comparing RNN vs LSTM vs Transformer
- ✅ Deciding on sequence architecture for time series, text, audio
- ✅ Understanding modern alternatives to LSTM
- ✅ Optimizing for sequence length, speed, or accuracy

DO NOT use for:
- ❌ Vision tasks (use cnn-families-and-selection)
- ❌ Graph-structured data (use graph-neural-networks-basics)
- ❌ LLM-specific questions (use llm-specialist pack)

**When in doubt:** If data is sequential/temporal → this skill.


## Selection Framework

### Step 1: Identify Key Characteristics

**Before recommending, ask:**

| Characteristic | Question | Impact |
|----------------|----------|--------|
| **Sequence Length** | Typical length? | Short (< 100) → LSTM/CNN, Medium (100-1k) → Transformer, Long (> 1k) → Sparse Transformer/S4 |
| **Data Type** | Language, time series, audio? | Language → Transformer, Time series → TCN/Transformer, Audio → Specialized |
| **Data Volume** | Training examples? | Small (< 10k) → LSTM/TCN, Large (> 100k) → Transformer |
| **Latency** | Real-time needed? | Yes → TCN/LSTM, No → Transformer |
| **Deployment** | Cloud/edge/mobile? | Edge → TCN/LSTM, Cloud → Any |

### Step 2: Apply Decision Tree

```
START: What's your primary constraint?

┌─ SEQUENCE LENGTH
│  ├─ Short (< 100 steps)
│  │  ├─ Language → BiLSTM or small Transformer
│  │  └─ Time series → TCN or LSTM
│  │
│  ├─ Medium (100-1000 steps)
│  │  ├─ Language → Transformer (BERT-style)
│  │  └─ Time series → Transformer or TCN
│  │
│  ├─ Long (1000-10000 steps)
│  │  ├─ Sparse Transformer (Longformer, BigBird)
│  │  └─ Hierarchical models
│  │
│  └─ Very Long (> 10000 steps)
│     └─ State Space Models (S4)
│
├─ DATA TYPE
│  ├─ Natural Language
│  │  ├─ < 50k data → BiLSTM or DistilBERT
│  │  └─ > 50k data → Transformer (BERT, RoBERTa)
│  │
│  ├─ Time Series
│  │  ├─ Fast training → TCN
│  │  ├─ Long sequences → Transformer
│  │  └─ Multivariate → Transformer with cross-series attention
│  │
│  └─ Audio
│     ├─ Waveform → WaveNet (TCN-based)
│     └─ Spectrograms → CNN + Transformer
│
└─ COMPUTATIONAL CONSTRAINT
   ├─ Edge device → TCN or small LSTM
   ├─ Real-time latency → TCN (parallel inference)
   └─ Cloud, no constraint → Transformer
```


## Architecture Catalog

### 1. RNN (Recurrent Neural Networks) - Legacy Foundation

**Architecture:** Basic recurrent cell with hidden state

**Status:** **OUTDATED** - don't use for new projects

**Why it existed:**
- First neural approach to sequences
- Hidden state captures temporal information
- Theoretically can model any sequence

**Why it failed:**
- Vanishing gradient (can't learn long dependencies)
- Very slow training (sequential processing)
- Replaced by LSTM in 2014

**When to mention:**
- Historical context only
- Teaching purposes
- Never recommend for production

**Key Insight:** Proved neural nets could handle sequences, but impractical due to vanishing gradients


### 2. LSTM (Long Short-Term Memory) - Legacy Standard

**Architecture:** Gated recurrent cell (forget, input, output gates)

**Complexity:** O(n) memory, sequential processing

**Strengths:**
- Solves vanishing gradient (gates maintain long-term info)
- Works well for short-medium sequences (< 500 steps)
- Small datasets (< 10k examples)
- Low memory footprint

**Weaknesses:**
- Sequential processing (slow training, can't parallelize)
- Still struggles with very long sequences (> 1000 steps)
- Slow inference (especially bidirectional)
- Superseded by Transformers for most language tasks

**When to Use:**
- ✅ Small datasets (< 10k examples)
- ✅ Short sequences (< 100 steps)
- ✅ Edge deployment (low memory)
- ✅ Baseline comparison

**When NOT to Use:**
- ❌ Large datasets (Transformer better)
- ❌ Long sequences (> 500 steps)
- ❌ Modern NLP (Transformer standard)
- ❌ Fast training needed (TCN better)

**Code Example:**
```python
class SeqLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size,
                           num_layers=2,
                           batch_first=True,
                           bidirectional=True)
        self.fc = nn.Linear(hidden_size * 2, num_classes)

    def forward(self, x):
        # x: (batch, seq_len, features)
        lstm_out, _ = self.lstm(x)
        # Use last timestep
        out = self.fc(lstm_out[:, -1, :])
        return out
```

**Status:** Legacy but still useful for specific cases (small data, edge deployment)


### 3. GRU (Gated Recurrent Unit) - Simplified LSTM

**Architecture:** Simplified gating (2 gates instead of 3)

**Advantages over LSTM:**
- Fewer parameters (faster training)
- Similar performance in many tasks
- Lower memory

**Disadvantages:**
- Still sequential (same as LSTM)
- No major advantage over LSTM in practice
- Also superseded by Transformers

**When to Use:**
- Same as LSTM, but prefer LSTM for slightly better performance
- Use if computational savings matter

**Status:** Rarely recommended - if using recurrent, prefer LSTM or move to Transformer/TCN


### 4. Transformer - Modern Standard

**Architecture:** Self-attention mechanism, parallel processing

**Complexity:**
- Memory: O(n²) for sequence length n
- Compute: O(n²d) where d is embedding dimension

**Strengths:**
- ✅ Parallel processing (fast training)
- ✅ Captures long-range dependencies (better than LSTM)
- ✅ State-of-the-art for language (BERT, GPT)
- ✅ Pre-trained models available
- ✅ Scales with data (more data = better performance)

**Weaknesses:**
- ❌ Quadratic memory (struggles with sequences > 1000)
- ❌ Needs more data than LSTM (> 10k examples)
- ❌ Slower inference than TCN
- ❌ Harder to interpret than RNN

**When to Use:**
- ✅ **Natural language** (current standard)
- ✅ Medium sequences (100-1000 tokens)
- ✅ Large datasets (> 50k examples)
- ✅ Pre-training available (BERT, GPT)
- ✅ Accuracy priority

**When NOT to Use:**
- ❌ Short sequences (< 50 tokens) - LSTM/CNN competitive, simpler
- ❌ Very long sequences (> 2000) - quadratic memory explodes
- ❌ Small datasets (< 10k) - will overfit
- ❌ Edge deployment - large model size

**Memory Analysis:**
```python
# Standard Transformer attention

# For sequence length n=1000, batch_size=32, embedding_dim=512:
attention_weights = softmax(Q @ K^T / sqrt(d))  # Shape: (32, 1000, 1000)
# Memory: 32 * 1000 * 1000 * 4 bytes = 128 MB just for attention!

# For n=5000:
# Memory: 32 * 5000 * 5000 * 4 bytes = 3.2 GB per batch!
# → Impossible on most GPUs
```

**Code Example:**
```python
from transformers import BertModel, BertTokenizer

# Pre-trained BERT for text classification
class TransformerClassifier(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.classifier = nn.Linear(768, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids,
                           attention_mask=attention_mask)
        # Use [CLS] token representation
        pooled = outputs.pooler_output
        return self.classifier(pooled)

# Fine-tuning
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TransformerClassifier(num_classes=2)
```

**Status:** **Current standard for NLP**, competitive for time series with large data


### 5. TCN (Temporal Convolutional Network) - Underrated Alternative

**Architecture:** 1D convolutions with dilated causal convolutions

**Complexity:** O(n) memory, fully parallel processing

**Strengths:**
- ✅ **Parallel training** (much faster than LSTM)
- ✅ **Parallel inference** (faster than LSTM/Transformer)
- ✅ Linear memory (no quadratic blow-up)
- ✅ Large receptive field (dilation)
- ✅ Works well for time series
- ✅ Simple architecture

**Weaknesses:**
- ❌ Less popular (fewer pre-trained models)
- ❌ Not standard for language (Transformer dominates)
- ❌ Fixed receptive field (vs adaptive attention)

**When to Use:**
- ✅ **Time series forecasting** (often BETTER than LSTM)
- ✅ **Fast training needed** (2-3x faster than LSTM)
- ✅ **Fast inference** (real-time applications)
- ✅ Long sequences (linear memory)
- ✅ Audio processing (WaveNet is TCN-based)

**When NOT to Use:**
- ❌ Natural language with pre-training available (use Transformer)
- ❌ Need very large receptive field (Transformer better)

**Performance Comparison:**
```
Time series forecasting (1000-step sequences):

Training speed:
- LSTM: 100% (baseline, sequential)
- TCN: 35% (2.8x faster, parallel)
- Transformer: 45% (2.2x faster)

Inference speed:
- LSTM: 100% (sequential)
- TCN: 20% (5x faster, parallel)
- Transformer: 60% (1.7x faster)

Accuracy (similar across all three):
- LSTM: Baseline
- TCN: Equal or slightly better
- Transformer: Equal or slightly better (needs more data)

Conclusion: TCN wins on speed, matches accuracy
```

**Code Example:**
```python
class TCN(nn.Module):
    def __init__(self, input_channels, num_channels, kernel_size=3):
        super().__init__()

        layers = []
        num_levels = len(num_channels)

        for i in range(num_levels):
            dilation_size = 2 ** i
            in_channels = input_channels if i == 0 else num_channels[i-1]
            out_channels = num_channels[i]

            # Causal dilated convolution
            layers.append(
                nn.Conv1d(in_channels, out_channels, kernel_size,
                         padding=(kernel_size-1) * dilation_size,
                         dilation=dilation_size)
            )
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(0.2))

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

# Usage for time series
# Input: (batch, channels, sequence_length)
model = TCN(input_channels=1, num_channels=[64, 128, 256])
```

**Key Insight:** Dilated convolutions create exponentially large receptive field (2^k) while maintaining linear memory

**Status:** **Excellent for time series**, underrated, should be considered before LSTM


### 6. Sparse Transformers - Long Sequence Specialists

**Architecture:** Modified attention patterns to reduce complexity

**Variants:**
- **Longformer**: Local + global attention
- **BigBird**: Random + local + global attention
- **Linformer**: Low-rank projection of keys/values
- **Performer**: Kernel approximation of attention

**Complexity:** O(n log n) or O(n) depending on variant

**When to Use:**
- ✅ **Long sequences** (1000-10000 tokens)
- ✅ Document processing (multi-page documents)
- ✅ Long-context language modeling
- ✅ When standard Transformer runs out of memory

**Trade-offs:**
- Slightly lower accuracy than full attention (approximation)
- More complex implementation
- Fewer pre-trained models

**Example Use Cases:**
- Legal document analysis (10k+ tokens)
- Scientific paper understanding
- Long-form text generation
- Time series with thousands of steps

**Status:** Specialized for long sequences, active research area


### 7. State Space Models (S4) - Cutting Edge

**Architecture:** Structured state space with efficient recurrence

**Complexity:** O(n log n) training, O(n) inference

**Strengths:**
- ✅ **Very long sequences** (10k-100k steps)
- ✅ Linear inference complexity
- ✅ Strong theoretical foundations
- ✅ Handles continuous-time sequences

**Weaknesses:**
- ❌ Newer (less mature ecosystem)
- ❌ Complex mathematics
- ❌ Fewer pre-trained models
- ❌ Harder to implement

**When to Use:**
- ✅ Extremely long sequences (> 10k steps)
- ✅ Audio (raw waveforms, 16kHz sampling)
- ✅ Medical signals (ECG, EEG)
- ✅ Research applications

**Status:** **Cutting edge** (2022+), promising for very long sequences


## Practical Selection Guide

### Scenario 1: Natural Language Processing

**Short text (< 50 tokens, e.g., tweets, titles):**
```
Small dataset (< 10k):
→ BiLSTM or 1D CNN (simple, effective)

Large dataset (> 10k):
→ DistilBERT (smaller Transformer, 40M params)
→ Or BiLSTM if latency critical
```

**Medium text (50-512 tokens, e.g., reviews, articles):**
```
Standard approach:
→ BERT, RoBERTa, or similar (110M params)
→ Fine-tune on task-specific data

Small dataset:
→ DistilBERT (66M params, faster, similar accuracy)
```

**Long documents (> 512 tokens):**
```
→ Longformer (4096 tokens max)
→ BigBird (4096 tokens max)
→ Hierarchical: Process in chunks, aggregate
```


### Scenario 2: Time Series Forecasting

**Short sequences (< 100 steps):**
```
Fast training:
→ TCN (2-3x faster than LSTM)

Small dataset:
→ LSTM or simple models (ARIMA, Prophet)

Baseline:
→ LSTM (well-tested)
```

**Medium sequences (100-1000 steps):**
```
Best accuracy:
→ Transformer (if data > 50k examples)

Fast training/inference:
→ TCN (parallel processing)

Multivariate:
→ Transformer with cross-series attention
```

**Long sequences (> 1000 steps):**
```
→ Sparse Transformer (Informer for time series)
→ Hierarchical models (chunk + aggregate)
→ State Space Models (S4)
```


### Scenario 3: Audio Processing

**Waveform (raw audio, 16kHz):**
```
→ WaveNet (TCN-based)
→ State Space Models (S4)
```

**Spectrograms (mel-spectrograms):**
```
→ CNN + BiLSTM (traditional)
→ CNN + Transformer (modern)
```

**Speech recognition:**
```
→ Transformer (Wav2Vec 2.0, Whisper)
→ Pre-trained models available
```


## Trade-Off Analysis

### Speed Comparison

**Training speed (1000-step sequences):**
```
LSTM:         100% (baseline, sequential)
GRU:          75% (simpler gates)
TCN:          35% (2.8x faster, parallel)
Transformer:  45% (2.2x faster, parallel)

Conclusion: TCN fastest for training
```

**Inference speed:**
```
LSTM:         100% (sequential)
BiLSTM:       200% (2x passes)
TCN:          20% (5x faster, parallel)
Transformer:  60% (faster, but attention overhead)

Conclusion: TCN fastest for inference
```


### Memory Comparison

**Sequence length n=1000, batch=32:**
```
LSTM:              ~500 MB (linear in n)
Transformer:       ~2 GB (quadratic in n)
TCN:               ~400 MB (linear in n)
Sparse Transformer: ~800 MB (n log n)

For n=5000:
LSTM:              ~2 GB
Transformer:       OUT OF MEMORY (50 GB needed!)
TCN:               ~2 GB
Sparse Transformer: ~4 GB
```


### Accuracy vs Data Size

**Small dataset (< 10k examples):**
```
LSTM:       ★★★★☆ (works well with little data)
Transformer: ★★☆☆☆ (overfits, needs more data)
TCN:        ★★★★☆ (similar to LSTM)

Winner: LSTM or TCN
```

**Large dataset (> 100k examples):**
```
LSTM:       ★★★☆☆ (good but plateaus)
Transformer: ★★★★★ (best, scales with data)
TCN:        ★★★★☆ (competitive)

Winner: Transformer
```


## Common Pitfalls

### Pitfall 1: Using LSTM in 2025 Without Considering Modern Alternatives
**Symptom:** Defaulting to LSTM for all sequence tasks

**Why it's wrong:** Transformers (language) and TCN (time series) often better

**Fix:** Consider Transformer for language, TCN for time series, LSTM for small data/edge only


### Pitfall 2: Using Standard Transformer for Very Long Sequences
**Symptom:** Running out of memory on sequences > 1000 tokens

**Why it's wrong:** O(n²) memory explodes

**Fix:** Use Sparse Transformer (Longformer, BigBird) or hierarchical approach


### Pitfall 3: Not Trying TCN for Time Series
**Symptom:** Struggling with slow LSTM training

**Why it's wrong:** TCN is 2-3x faster, often more accurate

**Fix:** Try TCN before optimizing LSTM


### Pitfall 4: Using Transformer for Small Datasets
**Symptom:** Transformer overfits on < 10k examples

**Why it's wrong:** Transformers need large datasets to work well

**Fix:** Use LSTM or TCN for small datasets, or use pre-trained Transformer


### Pitfall 5: Ignoring Sequence Length Constraints
**Symptom:** Choosing architecture without considering typical sequence length

**Why it's wrong:** Architecture effectiveness varies dramatically with length

**Fix:** Match architecture to sequence length (short → LSTM/CNN, long → Sparse Transformer)


## Evolution Timeline

**Understanding why architectures evolved:**

```
2010-2013: Basic RNN
→ Vanishing gradient problem
→ Can't learn long dependencies

2014: LSTM (Hochreiter & Schmidhuber)
→ Gates solve vanishing gradient
→ Became standard for sequences

2014: GRU
→ Simplified LSTM
→ Similar performance, fewer parameters

2017: Transformer (Attention Is All You Need)
→ Self-attention replaces recurrence
→ Parallel processing (fast training)
→ Revolutionized NLP

2018: TCN (Temporal Convolutional Networks)
→ Dilated convolutions for sequences
→ Often better than LSTM for time series
→ Underrated alternative

2020: Sparse Transformers
→ Reduce quadratic complexity
→ Enable longer sequences

2021: State Space Models (S4)
→ Very long sequences (10k-100k)
→ Theoretical foundations
→ Cutting edge research

Current (2025):
- NLP: Transformer standard (BERT, GPT)
- Time Series: TCN or Transformer
- Audio: Specialized (WaveNet, Transformer)
- Edge: LSTM or TCN (low memory)
```


## Decision Checklist

Before choosing sequence model:

```
☐ Sequence length? (< 100 / 100-1k / > 1k)
☐ Data type? (language / time series / audio / other)
☐ Dataset size? (< 10k / 10k-100k / > 100k)
☐ Latency requirement? (real-time / batch / offline)
☐ Deployment target? (cloud / edge / mobile)
☐ Pre-trained models available? (yes / no)
☐ Training speed critical? (yes / no)

Based on answers:
→ Language + large data → Transformer
→ Language + small data → BiLSTM or DistilBERT
→ Time series + speed → TCN
→ Time series + accuracy + large data → Transformer
→ Very long sequences → Sparse Transformer or S4
→ Edge deployment → TCN or LSTM
→ Real-time latency → TCN
```


## Integration with Other Skills

**For language-specific questions:**
→ `yzmir/llm-specialist/using-llm-specialist`
- LLM-specific Transformers (GPT, BERT variants)
- Fine-tuning strategies
- Prompt engineering

**For Transformer internals:**
→ `yzmir/neural-architectures/transformer-architecture-deepdive`
- Attention mechanisms
- Positional encoding
- Transformer variants

**After selecting architecture:**
→ `yzmir/training-optimization/using-training-optimization`
- Optimizer selection
- Learning rate schedules
- Handling sequence-specific training issues


## Summary

**Quick Reference Table:**

| Use Case | Best Choice | Alternative | Avoid |
|----------|-------------|-------------|-------|
| Short text (< 50 tokens) | BiLSTM, DistilBERT | 1D CNN | Full BERT (overkill) |
| Long text (> 512 tokens) | Longformer, BigBird | Hierarchical | Standard BERT (memory) |
| Time series (< 1k steps) | TCN, Transformer | LSTM | Basic RNN |
| Time series (> 1k steps) | Sparse Transformer, S4 | Hierarchical | Standard Transformer |
| Small dataset (< 10k) | LSTM, TCN | Simple models | Transformer (overfits) |
| Large dataset (> 100k) | Transformer | TCN | LSTM (plateaus) |
| Edge deployment | TCN, LSTM | Quantized Transformer | Large Transformer |
| Real-time inference | TCN | Small LSTM | BiLSTM, Transformer |

**Key Principles:**
1. **Don't default to LSTM** (outdated for most tasks)
2. **Transformer for language** (current standard, if data sufficient)
3. **TCN for time series** (fast, effective, underrated)
4. **Match to sequence length** (short → LSTM/CNN, long → Sparse Transformer)
5. **Consider modern alternatives** (don't stop at LSTM vs Transformer)


**END OF SKILL**