20 KiB
Sequence Models Comparison: Choosing the Right Architecture for Sequential Data
<CRITICAL_CONTEXT> Sequence modeling has evolved rapidly:
- 2014-2017: LSTM/GRU dominated
- 2017+: Transformers revolutionized the field
- 2018+: TCN emerged as efficient alternative
- 2021+: Sparse Transformers for very long sequences
- 2022+: State Space Models (S4) for extreme lengths
Don't default to LSTM (outdated) or blindly use Transformers (not always appropriate). Match architecture to your sequence characteristics. </CRITICAL_CONTEXT>
When to Use This Skill
Use this skill when:
- ✅ Selecting model for sequential/temporal data
- ✅ Comparing RNN vs LSTM vs Transformer
- ✅ Deciding on sequence architecture for time series, text, audio
- ✅ Understanding modern alternatives to LSTM
- ✅ Optimizing for sequence length, speed, or accuracy
DO NOT use for:
- ❌ Vision tasks (use cnn-families-and-selection)
- ❌ Graph-structured data (use graph-neural-networks-basics)
- ❌ LLM-specific questions (use llm-specialist pack)
When in doubt: If data is sequential/temporal → this skill.
Selection Framework
Step 1: Identify Key Characteristics
Before recommending, ask:
| Characteristic | Question | Impact |
|---|---|---|
| Sequence Length | Typical length? | Short (< 100) → LSTM/CNN, Medium (100-1k) → Transformer, Long (> 1k) → Sparse Transformer/S4 |
| Data Type | Language, time series, audio? | Language → Transformer, Time series → TCN/Transformer, Audio → Specialized |
| Data Volume | Training examples? | Small (< 10k) → LSTM/TCN, Large (> 100k) → Transformer |
| Latency | Real-time needed? | Yes → TCN/LSTM, No → Transformer |
| Deployment | Cloud/edge/mobile? | Edge → TCN/LSTM, Cloud → Any |
Step 2: Apply Decision Tree
START: What's your primary constraint?
┌─ SEQUENCE LENGTH
│ ├─ Short (< 100 steps)
│ │ ├─ Language → BiLSTM or small Transformer
│ │ └─ Time series → TCN or LSTM
│ │
│ ├─ Medium (100-1000 steps)
│ │ ├─ Language → Transformer (BERT-style)
│ │ └─ Time series → Transformer or TCN
│ │
│ ├─ Long (1000-10000 steps)
│ │ ├─ Sparse Transformer (Longformer, BigBird)
│ │ └─ Hierarchical models
│ │
│ └─ Very Long (> 10000 steps)
│ └─ State Space Models (S4)
│
├─ DATA TYPE
│ ├─ Natural Language
│ │ ├─ < 50k data → BiLSTM or DistilBERT
│ │ └─ > 50k data → Transformer (BERT, RoBERTa)
│ │
│ ├─ Time Series
│ │ ├─ Fast training → TCN
│ │ ├─ Long sequences → Transformer
│ │ └─ Multivariate → Transformer with cross-series attention
│ │
│ └─ Audio
│ ├─ Waveform → WaveNet (TCN-based)
│ └─ Spectrograms → CNN + Transformer
│
└─ COMPUTATIONAL CONSTRAINT
├─ Edge device → TCN or small LSTM
├─ Real-time latency → TCN (parallel inference)
└─ Cloud, no constraint → Transformer
Architecture Catalog
1. RNN (Recurrent Neural Networks) - Legacy Foundation
Architecture: Basic recurrent cell with hidden state
Status: OUTDATED - don't use for new projects
Why it existed:
- First neural approach to sequences
- Hidden state captures temporal information
- Theoretically can model any sequence
Why it failed:
- Vanishing gradient (can't learn long dependencies)
- Very slow training (sequential processing)
- Replaced by LSTM in 2014
When to mention:
- Historical context only
- Teaching purposes
- Never recommend for production
Key Insight: Proved neural nets could handle sequences, but impractical due to vanishing gradients
2. LSTM (Long Short-Term Memory) - Legacy Standard
Architecture: Gated recurrent cell (forget, input, output gates)
Complexity: O(n) memory, sequential processing
Strengths:
- Solves vanishing gradient (gates maintain long-term info)
- Works well for short-medium sequences (< 500 steps)
- Small datasets (< 10k examples)
- Low memory footprint
Weaknesses:
- Sequential processing (slow training, can't parallelize)
- Still struggles with very long sequences (> 1000 steps)
- Slow inference (especially bidirectional)
- Superseded by Transformers for most language tasks
When to Use:
- ✅ Small datasets (< 10k examples)
- ✅ Short sequences (< 100 steps)
- ✅ Edge deployment (low memory)
- ✅ Baseline comparison
When NOT to Use:
- ❌ Large datasets (Transformer better)
- ❌ Long sequences (> 500 steps)
- ❌ Modern NLP (Transformer standard)
- ❌ Fast training needed (TCN better)
Code Example:
class SeqLSTM(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super().__init__()
self.lstm = nn.LSTM(input_size, hidden_size,
num_layers=2,
batch_first=True,
bidirectional=True)
self.fc = nn.Linear(hidden_size * 2, num_classes)
def forward(self, x):
# x: (batch, seq_len, features)
lstm_out, _ = self.lstm(x)
# Use last timestep
out = self.fc(lstm_out[:, -1, :])
return out
Status: Legacy but still useful for specific cases (small data, edge deployment)
3. GRU (Gated Recurrent Unit) - Simplified LSTM
Architecture: Simplified gating (2 gates instead of 3)
Advantages over LSTM:
- Fewer parameters (faster training)
- Similar performance in many tasks
- Lower memory
Disadvantages:
- Still sequential (same as LSTM)
- No major advantage over LSTM in practice
- Also superseded by Transformers
When to Use:
- Same as LSTM, but prefer LSTM for slightly better performance
- Use if computational savings matter
Status: Rarely recommended - if using recurrent, prefer LSTM or move to Transformer/TCN
4. Transformer - Modern Standard
Architecture: Self-attention mechanism, parallel processing
Complexity:
- Memory: O(n²) for sequence length n
- Compute: O(n²d) where d is embedding dimension
Strengths:
- ✅ Parallel processing (fast training)
- ✅ Captures long-range dependencies (better than LSTM)
- ✅ State-of-the-art for language (BERT, GPT)
- ✅ Pre-trained models available
- ✅ Scales with data (more data = better performance)
Weaknesses:
- ❌ Quadratic memory (struggles with sequences > 1000)
- ❌ Needs more data than LSTM (> 10k examples)
- ❌ Slower inference than TCN
- ❌ Harder to interpret than RNN
When to Use:
- ✅ Natural language (current standard)
- ✅ Medium sequences (100-1000 tokens)
- ✅ Large datasets (> 50k examples)
- ✅ Pre-training available (BERT, GPT)
- ✅ Accuracy priority
When NOT to Use:
- ❌ Short sequences (< 50 tokens) - LSTM/CNN competitive, simpler
- ❌ Very long sequences (> 2000) - quadratic memory explodes
- ❌ Small datasets (< 10k) - will overfit
- ❌ Edge deployment - large model size
Memory Analysis:
# Standard Transformer attention
# For sequence length n=1000, batch_size=32, embedding_dim=512:
attention_weights = softmax(Q @ K^T / sqrt(d)) # Shape: (32, 1000, 1000)
# Memory: 32 * 1000 * 1000 * 4 bytes = 128 MB just for attention!
# For n=5000:
# Memory: 32 * 5000 * 5000 * 4 bytes = 3.2 GB per batch!
# → Impossible on most GPUs
Code Example:
from transformers import BertModel, BertTokenizer
# Pre-trained BERT for text classification
class TransformerClassifier(nn.Module):
def __init__(self, num_classes):
super().__init__()
self.bert = BertModel.from_pretrained('bert-base-uncased')
self.classifier = nn.Linear(768, num_classes)
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids,
attention_mask=attention_mask)
# Use [CLS] token representation
pooled = outputs.pooler_output
return self.classifier(pooled)
# Fine-tuning
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TransformerClassifier(num_classes=2)
Status: Current standard for NLP, competitive for time series with large data
5. TCN (Temporal Convolutional Network) - Underrated Alternative
Architecture: 1D convolutions with dilated causal convolutions
Complexity: O(n) memory, fully parallel processing
Strengths:
- ✅ Parallel training (much faster than LSTM)
- ✅ Parallel inference (faster than LSTM/Transformer)
- ✅ Linear memory (no quadratic blow-up)
- ✅ Large receptive field (dilation)
- ✅ Works well for time series
- ✅ Simple architecture
Weaknesses:
- ❌ Less popular (fewer pre-trained models)
- ❌ Not standard for language (Transformer dominates)
- ❌ Fixed receptive field (vs adaptive attention)
When to Use:
- ✅ Time series forecasting (often BETTER than LSTM)
- ✅ Fast training needed (2-3x faster than LSTM)
- ✅ Fast inference (real-time applications)
- ✅ Long sequences (linear memory)
- ✅ Audio processing (WaveNet is TCN-based)
When NOT to Use:
- ❌ Natural language with pre-training available (use Transformer)
- ❌ Need very large receptive field (Transformer better)
Performance Comparison:
Time series forecasting (1000-step sequences):
Training speed:
- LSTM: 100% (baseline, sequential)
- TCN: 35% (2.8x faster, parallel)
- Transformer: 45% (2.2x faster)
Inference speed:
- LSTM: 100% (sequential)
- TCN: 20% (5x faster, parallel)
- Transformer: 60% (1.7x faster)
Accuracy (similar across all three):
- LSTM: Baseline
- TCN: Equal or slightly better
- Transformer: Equal or slightly better (needs more data)
Conclusion: TCN wins on speed, matches accuracy
Code Example:
class TCN(nn.Module):
def __init__(self, input_channels, num_channels, kernel_size=3):
super().__init__()
layers = []
num_levels = len(num_channels)
for i in range(num_levels):
dilation_size = 2 ** i
in_channels = input_channels if i == 0 else num_channels[i-1]
out_channels = num_channels[i]
# Causal dilated convolution
layers.append(
nn.Conv1d(in_channels, out_channels, kernel_size,
padding=(kernel_size-1) * dilation_size,
dilation=dilation_size)
)
layers.append(nn.ReLU())
layers.append(nn.Dropout(0.2))
self.network = nn.Sequential(*layers)
def forward(self, x):
return self.network(x)
# Usage for time series
# Input: (batch, channels, sequence_length)
model = TCN(input_channels=1, num_channels=[64, 128, 256])
Key Insight: Dilated convolutions create exponentially large receptive field (2^k) while maintaining linear memory
Status: Excellent for time series, underrated, should be considered before LSTM
6. Sparse Transformers - Long Sequence Specialists
Architecture: Modified attention patterns to reduce complexity
Variants:
- Longformer: Local + global attention
- BigBird: Random + local + global attention
- Linformer: Low-rank projection of keys/values
- Performer: Kernel approximation of attention
Complexity: O(n log n) or O(n) depending on variant
When to Use:
- ✅ Long sequences (1000-10000 tokens)
- ✅ Document processing (multi-page documents)
- ✅ Long-context language modeling
- ✅ When standard Transformer runs out of memory
Trade-offs:
- Slightly lower accuracy than full attention (approximation)
- More complex implementation
- Fewer pre-trained models
Example Use Cases:
- Legal document analysis (10k+ tokens)
- Scientific paper understanding
- Long-form text generation
- Time series with thousands of steps
Status: Specialized for long sequences, active research area
7. State Space Models (S4) - Cutting Edge
Architecture: Structured state space with efficient recurrence
Complexity: O(n log n) training, O(n) inference
Strengths:
- ✅ Very long sequences (10k-100k steps)
- ✅ Linear inference complexity
- ✅ Strong theoretical foundations
- ✅ Handles continuous-time sequences
Weaknesses:
- ❌ Newer (less mature ecosystem)
- ❌ Complex mathematics
- ❌ Fewer pre-trained models
- ❌ Harder to implement
When to Use:
- ✅ Extremely long sequences (> 10k steps)
- ✅ Audio (raw waveforms, 16kHz sampling)
- ✅ Medical signals (ECG, EEG)
- ✅ Research applications
Status: Cutting edge (2022+), promising for very long sequences
Practical Selection Guide
Scenario 1: Natural Language Processing
Short text (< 50 tokens, e.g., tweets, titles):
Small dataset (< 10k):
→ BiLSTM or 1D CNN (simple, effective)
Large dataset (> 10k):
→ DistilBERT (smaller Transformer, 40M params)
→ Or BiLSTM if latency critical
Medium text (50-512 tokens, e.g., reviews, articles):
Standard approach:
→ BERT, RoBERTa, or similar (110M params)
→ Fine-tune on task-specific data
Small dataset:
→ DistilBERT (66M params, faster, similar accuracy)
Long documents (> 512 tokens):
→ Longformer (4096 tokens max)
→ BigBird (4096 tokens max)
→ Hierarchical: Process in chunks, aggregate
Scenario 2: Time Series Forecasting
Short sequences (< 100 steps):
Fast training:
→ TCN (2-3x faster than LSTM)
Small dataset:
→ LSTM or simple models (ARIMA, Prophet)
Baseline:
→ LSTM (well-tested)
Medium sequences (100-1000 steps):
Best accuracy:
→ Transformer (if data > 50k examples)
Fast training/inference:
→ TCN (parallel processing)
Multivariate:
→ Transformer with cross-series attention
Long sequences (> 1000 steps):
→ Sparse Transformer (Informer for time series)
→ Hierarchical models (chunk + aggregate)
→ State Space Models (S4)
Scenario 3: Audio Processing
Waveform (raw audio, 16kHz):
→ WaveNet (TCN-based)
→ State Space Models (S4)
Spectrograms (mel-spectrograms):
→ CNN + BiLSTM (traditional)
→ CNN + Transformer (modern)
Speech recognition:
→ Transformer (Wav2Vec 2.0, Whisper)
→ Pre-trained models available
Trade-Off Analysis
Speed Comparison
Training speed (1000-step sequences):
LSTM: 100% (baseline, sequential)
GRU: 75% (simpler gates)
TCN: 35% (2.8x faster, parallel)
Transformer: 45% (2.2x faster, parallel)
Conclusion: TCN fastest for training
Inference speed:
LSTM: 100% (sequential)
BiLSTM: 200% (2x passes)
TCN: 20% (5x faster, parallel)
Transformer: 60% (faster, but attention overhead)
Conclusion: TCN fastest for inference
Memory Comparison
Sequence length n=1000, batch=32:
LSTM: ~500 MB (linear in n)
Transformer: ~2 GB (quadratic in n)
TCN: ~400 MB (linear in n)
Sparse Transformer: ~800 MB (n log n)
For n=5000:
LSTM: ~2 GB
Transformer: OUT OF MEMORY (50 GB needed!)
TCN: ~2 GB
Sparse Transformer: ~4 GB
Accuracy vs Data Size
Small dataset (< 10k examples):
LSTM: ★★★★☆ (works well with little data)
Transformer: ★★☆☆☆ (overfits, needs more data)
TCN: ★★★★☆ (similar to LSTM)
Winner: LSTM or TCN
Large dataset (> 100k examples):
LSTM: ★★★☆☆ (good but plateaus)
Transformer: ★★★★★ (best, scales with data)
TCN: ★★★★☆ (competitive)
Winner: Transformer
Common Pitfalls
Pitfall 1: Using LSTM in 2025 Without Considering Modern Alternatives
Symptom: Defaulting to LSTM for all sequence tasks
Why it's wrong: Transformers (language) and TCN (time series) often better
Fix: Consider Transformer for language, TCN for time series, LSTM for small data/edge only
Pitfall 2: Using Standard Transformer for Very Long Sequences
Symptom: Running out of memory on sequences > 1000 tokens
Why it's wrong: O(n²) memory explodes
Fix: Use Sparse Transformer (Longformer, BigBird) or hierarchical approach
Pitfall 3: Not Trying TCN for Time Series
Symptom: Struggling with slow LSTM training
Why it's wrong: TCN is 2-3x faster, often more accurate
Fix: Try TCN before optimizing LSTM
Pitfall 4: Using Transformer for Small Datasets
Symptom: Transformer overfits on < 10k examples
Why it's wrong: Transformers need large datasets to work well
Fix: Use LSTM or TCN for small datasets, or use pre-trained Transformer
Pitfall 5: Ignoring Sequence Length Constraints
Symptom: Choosing architecture without considering typical sequence length
Why it's wrong: Architecture effectiveness varies dramatically with length
Fix: Match architecture to sequence length (short → LSTM/CNN, long → Sparse Transformer)
Evolution Timeline
Understanding why architectures evolved:
2010-2013: Basic RNN
→ Vanishing gradient problem
→ Can't learn long dependencies
2014: LSTM (Hochreiter & Schmidhuber)
→ Gates solve vanishing gradient
→ Became standard for sequences
2014: GRU
→ Simplified LSTM
→ Similar performance, fewer parameters
2017: Transformer (Attention Is All You Need)
→ Self-attention replaces recurrence
→ Parallel processing (fast training)
→ Revolutionized NLP
2018: TCN (Temporal Convolutional Networks)
→ Dilated convolutions for sequences
→ Often better than LSTM for time series
→ Underrated alternative
2020: Sparse Transformers
→ Reduce quadratic complexity
→ Enable longer sequences
2021: State Space Models (S4)
→ Very long sequences (10k-100k)
→ Theoretical foundations
→ Cutting edge research
Current (2025):
- NLP: Transformer standard (BERT, GPT)
- Time Series: TCN or Transformer
- Audio: Specialized (WaveNet, Transformer)
- Edge: LSTM or TCN (low memory)
Decision Checklist
Before choosing sequence model:
☐ Sequence length? (< 100 / 100-1k / > 1k)
☐ Data type? (language / time series / audio / other)
☐ Dataset size? (< 10k / 10k-100k / > 100k)
☐ Latency requirement? (real-time / batch / offline)
☐ Deployment target? (cloud / edge / mobile)
☐ Pre-trained models available? (yes / no)
☐ Training speed critical? (yes / no)
Based on answers:
→ Language + large data → Transformer
→ Language + small data → BiLSTM or DistilBERT
→ Time series + speed → TCN
→ Time series + accuracy + large data → Transformer
→ Very long sequences → Sparse Transformer or S4
→ Edge deployment → TCN or LSTM
→ Real-time latency → TCN
Integration with Other Skills
For language-specific questions:
→ yzmir/llm-specialist/using-llm-specialist
- LLM-specific Transformers (GPT, BERT variants)
- Fine-tuning strategies
- Prompt engineering
For Transformer internals:
→ yzmir/neural-architectures/transformer-architecture-deepdive
- Attention mechanisms
- Positional encoding
- Transformer variants
After selecting architecture:
→ yzmir/training-optimization/using-training-optimization
- Optimizer selection
- Learning rate schedules
- Handling sequence-specific training issues
Summary
Quick Reference Table:
| Use Case | Best Choice | Alternative | Avoid |
|---|---|---|---|
| Short text (< 50 tokens) | BiLSTM, DistilBERT | 1D CNN | Full BERT (overkill) |
| Long text (> 512 tokens) | Longformer, BigBird | Hierarchical | Standard BERT (memory) |
| Time series (< 1k steps) | TCN, Transformer | LSTM | Basic RNN |
| Time series (> 1k steps) | Sparse Transformer, S4 | Hierarchical | Standard Transformer |
| Small dataset (< 10k) | LSTM, TCN | Simple models | Transformer (overfits) |
| Large dataset (> 100k) | Transformer | TCN | LSTM (plateaus) |
| Edge deployment | TCN, LSTM | Quantized Transformer | Large Transformer |
| Real-time inference | TCN | Small LSTM | BiLSTM, Transformer |
Key Principles:
- Don't default to LSTM (outdated for most tasks)
- Transformer for language (current standard, if data sufficient)
- TCN for time series (fast, effective, underrated)
- Match to sequence length (short → LSTM/CNN, long → Sparse Transformer)
- Consider modern alternatives (don't stop at LSTM vs Transformer)
END OF SKILL