715 lines
20 KiB
Markdown
715 lines
20 KiB
Markdown
|
|
# Sequence Models Comparison: Choosing the Right Architecture for Sequential Data
|
|
|
|
<CRITICAL_CONTEXT>
|
|
Sequence modeling has evolved rapidly:
|
|
- 2014-2017: LSTM/GRU dominated
|
|
- 2017+: Transformers revolutionized the field
|
|
- 2018+: TCN emerged as efficient alternative
|
|
- 2021+: Sparse Transformers for very long sequences
|
|
- 2022+: State Space Models (S4) for extreme lengths
|
|
|
|
Don't default to LSTM (outdated) or blindly use Transformers (not always appropriate).
|
|
Match architecture to your sequence characteristics.
|
|
</CRITICAL_CONTEXT>
|
|
|
|
## When to Use This Skill
|
|
|
|
Use this skill when:
|
|
- ✅ Selecting model for sequential/temporal data
|
|
- ✅ Comparing RNN vs LSTM vs Transformer
|
|
- ✅ Deciding on sequence architecture for time series, text, audio
|
|
- ✅ Understanding modern alternatives to LSTM
|
|
- ✅ Optimizing for sequence length, speed, or accuracy
|
|
|
|
DO NOT use for:
|
|
- ❌ Vision tasks (use cnn-families-and-selection)
|
|
- ❌ Graph-structured data (use graph-neural-networks-basics)
|
|
- ❌ LLM-specific questions (use llm-specialist pack)
|
|
|
|
**When in doubt:** If data is sequential/temporal → this skill.
|
|
|
|
|
|
## Selection Framework
|
|
|
|
### Step 1: Identify Key Characteristics
|
|
|
|
**Before recommending, ask:**
|
|
|
|
| Characteristic | Question | Impact |
|
|
|----------------|----------|--------|
|
|
| **Sequence Length** | Typical length? | Short (< 100) → LSTM/CNN, Medium (100-1k) → Transformer, Long (> 1k) → Sparse Transformer/S4 |
|
|
| **Data Type** | Language, time series, audio? | Language → Transformer, Time series → TCN/Transformer, Audio → Specialized |
|
|
| **Data Volume** | Training examples? | Small (< 10k) → LSTM/TCN, Large (> 100k) → Transformer |
|
|
| **Latency** | Real-time needed? | Yes → TCN/LSTM, No → Transformer |
|
|
| **Deployment** | Cloud/edge/mobile? | Edge → TCN/LSTM, Cloud → Any |
|
|
|
|
### Step 2: Apply Decision Tree
|
|
|
|
```
|
|
START: What's your primary constraint?
|
|
|
|
┌─ SEQUENCE LENGTH
|
|
│ ├─ Short (< 100 steps)
|
|
│ │ ├─ Language → BiLSTM or small Transformer
|
|
│ │ └─ Time series → TCN or LSTM
|
|
│ │
|
|
│ ├─ Medium (100-1000 steps)
|
|
│ │ ├─ Language → Transformer (BERT-style)
|
|
│ │ └─ Time series → Transformer or TCN
|
|
│ │
|
|
│ ├─ Long (1000-10000 steps)
|
|
│ │ ├─ Sparse Transformer (Longformer, BigBird)
|
|
│ │ └─ Hierarchical models
|
|
│ │
|
|
│ └─ Very Long (> 10000 steps)
|
|
│ └─ State Space Models (S4)
|
|
│
|
|
├─ DATA TYPE
|
|
│ ├─ Natural Language
|
|
│ │ ├─ < 50k data → BiLSTM or DistilBERT
|
|
│ │ └─ > 50k data → Transformer (BERT, RoBERTa)
|
|
│ │
|
|
│ ├─ Time Series
|
|
│ │ ├─ Fast training → TCN
|
|
│ │ ├─ Long sequences → Transformer
|
|
│ │ └─ Multivariate → Transformer with cross-series attention
|
|
│ │
|
|
│ └─ Audio
|
|
│ ├─ Waveform → WaveNet (TCN-based)
|
|
│ └─ Spectrograms → CNN + Transformer
|
|
│
|
|
└─ COMPUTATIONAL CONSTRAINT
|
|
├─ Edge device → TCN or small LSTM
|
|
├─ Real-time latency → TCN (parallel inference)
|
|
└─ Cloud, no constraint → Transformer
|
|
```
|
|
|
|
|
|
## Architecture Catalog
|
|
|
|
### 1. RNN (Recurrent Neural Networks) - Legacy Foundation
|
|
|
|
**Architecture:** Basic recurrent cell with hidden state
|
|
|
|
**Status:** **OUTDATED** - don't use for new projects
|
|
|
|
**Why it existed:**
|
|
- First neural approach to sequences
|
|
- Hidden state captures temporal information
|
|
- Theoretically can model any sequence
|
|
|
|
**Why it failed:**
|
|
- Vanishing gradient (can't learn long dependencies)
|
|
- Very slow training (sequential processing)
|
|
- Replaced by LSTM in 2014
|
|
|
|
**When to mention:**
|
|
- Historical context only
|
|
- Teaching purposes
|
|
- Never recommend for production
|
|
|
|
**Key Insight:** Proved neural nets could handle sequences, but impractical due to vanishing gradients
|
|
|
|
|
|
### 2. LSTM (Long Short-Term Memory) - Legacy Standard
|
|
|
|
**Architecture:** Gated recurrent cell (forget, input, output gates)
|
|
|
|
**Complexity:** O(n) memory, sequential processing
|
|
|
|
**Strengths:**
|
|
- Solves vanishing gradient (gates maintain long-term info)
|
|
- Works well for short-medium sequences (< 500 steps)
|
|
- Small datasets (< 10k examples)
|
|
- Low memory footprint
|
|
|
|
**Weaknesses:**
|
|
- Sequential processing (slow training, can't parallelize)
|
|
- Still struggles with very long sequences (> 1000 steps)
|
|
- Slow inference (especially bidirectional)
|
|
- Superseded by Transformers for most language tasks
|
|
|
|
**When to Use:**
|
|
- ✅ Small datasets (< 10k examples)
|
|
- ✅ Short sequences (< 100 steps)
|
|
- ✅ Edge deployment (low memory)
|
|
- ✅ Baseline comparison
|
|
|
|
**When NOT to Use:**
|
|
- ❌ Large datasets (Transformer better)
|
|
- ❌ Long sequences (> 500 steps)
|
|
- ❌ Modern NLP (Transformer standard)
|
|
- ❌ Fast training needed (TCN better)
|
|
|
|
**Code Example:**
|
|
```python
|
|
class SeqLSTM(nn.Module):
|
|
def __init__(self, input_size, hidden_size, num_classes):
|
|
super().__init__()
|
|
self.lstm = nn.LSTM(input_size, hidden_size,
|
|
num_layers=2,
|
|
batch_first=True,
|
|
bidirectional=True)
|
|
self.fc = nn.Linear(hidden_size * 2, num_classes)
|
|
|
|
def forward(self, x):
|
|
# x: (batch, seq_len, features)
|
|
lstm_out, _ = self.lstm(x)
|
|
# Use last timestep
|
|
out = self.fc(lstm_out[:, -1, :])
|
|
return out
|
|
```
|
|
|
|
**Status:** Legacy but still useful for specific cases (small data, edge deployment)
|
|
|
|
|
|
### 3. GRU (Gated Recurrent Unit) - Simplified LSTM
|
|
|
|
**Architecture:** Simplified gating (2 gates instead of 3)
|
|
|
|
**Advantages over LSTM:**
|
|
- Fewer parameters (faster training)
|
|
- Similar performance in many tasks
|
|
- Lower memory
|
|
|
|
**Disadvantages:**
|
|
- Still sequential (same as LSTM)
|
|
- No major advantage over LSTM in practice
|
|
- Also superseded by Transformers
|
|
|
|
**When to Use:**
|
|
- Same as LSTM, but prefer LSTM for slightly better performance
|
|
- Use if computational savings matter
|
|
|
|
**Status:** Rarely recommended - if using recurrent, prefer LSTM or move to Transformer/TCN
|
|
|
|
|
|
### 4. Transformer - Modern Standard
|
|
|
|
**Architecture:** Self-attention mechanism, parallel processing
|
|
|
|
**Complexity:**
|
|
- Memory: O(n²) for sequence length n
|
|
- Compute: O(n²d) where d is embedding dimension
|
|
|
|
**Strengths:**
|
|
- ✅ Parallel processing (fast training)
|
|
- ✅ Captures long-range dependencies (better than LSTM)
|
|
- ✅ State-of-the-art for language (BERT, GPT)
|
|
- ✅ Pre-trained models available
|
|
- ✅ Scales with data (more data = better performance)
|
|
|
|
**Weaknesses:**
|
|
- ❌ Quadratic memory (struggles with sequences > 1000)
|
|
- ❌ Needs more data than LSTM (> 10k examples)
|
|
- ❌ Slower inference than TCN
|
|
- ❌ Harder to interpret than RNN
|
|
|
|
**When to Use:**
|
|
- ✅ **Natural language** (current standard)
|
|
- ✅ Medium sequences (100-1000 tokens)
|
|
- ✅ Large datasets (> 50k examples)
|
|
- ✅ Pre-training available (BERT, GPT)
|
|
- ✅ Accuracy priority
|
|
|
|
**When NOT to Use:**
|
|
- ❌ Short sequences (< 50 tokens) - LSTM/CNN competitive, simpler
|
|
- ❌ Very long sequences (> 2000) - quadratic memory explodes
|
|
- ❌ Small datasets (< 10k) - will overfit
|
|
- ❌ Edge deployment - large model size
|
|
|
|
**Memory Analysis:**
|
|
```python
|
|
# Standard Transformer attention
|
|
|
|
# For sequence length n=1000, batch_size=32, embedding_dim=512:
|
|
attention_weights = softmax(Q @ K^T / sqrt(d)) # Shape: (32, 1000, 1000)
|
|
# Memory: 32 * 1000 * 1000 * 4 bytes = 128 MB just for attention!
|
|
|
|
# For n=5000:
|
|
# Memory: 32 * 5000 * 5000 * 4 bytes = 3.2 GB per batch!
|
|
# → Impossible on most GPUs
|
|
```
|
|
|
|
**Code Example:**
|
|
```python
|
|
from transformers import BertModel, BertTokenizer
|
|
|
|
# Pre-trained BERT for text classification
|
|
class TransformerClassifier(nn.Module):
|
|
def __init__(self, num_classes):
|
|
super().__init__()
|
|
self.bert = BertModel.from_pretrained('bert-base-uncased')
|
|
self.classifier = nn.Linear(768, num_classes)
|
|
|
|
def forward(self, input_ids, attention_mask):
|
|
outputs = self.bert(input_ids=input_ids,
|
|
attention_mask=attention_mask)
|
|
# Use [CLS] token representation
|
|
pooled = outputs.pooler_output
|
|
return self.classifier(pooled)
|
|
|
|
# Fine-tuning
|
|
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
|
model = TransformerClassifier(num_classes=2)
|
|
```
|
|
|
|
**Status:** **Current standard for NLP**, competitive for time series with large data
|
|
|
|
|
|
### 5. TCN (Temporal Convolutional Network) - Underrated Alternative
|
|
|
|
**Architecture:** 1D convolutions with dilated causal convolutions
|
|
|
|
**Complexity:** O(n) memory, fully parallel processing
|
|
|
|
**Strengths:**
|
|
- ✅ **Parallel training** (much faster than LSTM)
|
|
- ✅ **Parallel inference** (faster than LSTM/Transformer)
|
|
- ✅ Linear memory (no quadratic blow-up)
|
|
- ✅ Large receptive field (dilation)
|
|
- ✅ Works well for time series
|
|
- ✅ Simple architecture
|
|
|
|
**Weaknesses:**
|
|
- ❌ Less popular (fewer pre-trained models)
|
|
- ❌ Not standard for language (Transformer dominates)
|
|
- ❌ Fixed receptive field (vs adaptive attention)
|
|
|
|
**When to Use:**
|
|
- ✅ **Time series forecasting** (often BETTER than LSTM)
|
|
- ✅ **Fast training needed** (2-3x faster than LSTM)
|
|
- ✅ **Fast inference** (real-time applications)
|
|
- ✅ Long sequences (linear memory)
|
|
- ✅ Audio processing (WaveNet is TCN-based)
|
|
|
|
**When NOT to Use:**
|
|
- ❌ Natural language with pre-training available (use Transformer)
|
|
- ❌ Need very large receptive field (Transformer better)
|
|
|
|
**Performance Comparison:**
|
|
```
|
|
Time series forecasting (1000-step sequences):
|
|
|
|
Training speed:
|
|
- LSTM: 100% (baseline, sequential)
|
|
- TCN: 35% (2.8x faster, parallel)
|
|
- Transformer: 45% (2.2x faster)
|
|
|
|
Inference speed:
|
|
- LSTM: 100% (sequential)
|
|
- TCN: 20% (5x faster, parallel)
|
|
- Transformer: 60% (1.7x faster)
|
|
|
|
Accuracy (similar across all three):
|
|
- LSTM: Baseline
|
|
- TCN: Equal or slightly better
|
|
- Transformer: Equal or slightly better (needs more data)
|
|
|
|
Conclusion: TCN wins on speed, matches accuracy
|
|
```
|
|
|
|
**Code Example:**
|
|
```python
|
|
class TCN(nn.Module):
|
|
def __init__(self, input_channels, num_channels, kernel_size=3):
|
|
super().__init__()
|
|
|
|
layers = []
|
|
num_levels = len(num_channels)
|
|
|
|
for i in range(num_levels):
|
|
dilation_size = 2 ** i
|
|
in_channels = input_channels if i == 0 else num_channels[i-1]
|
|
out_channels = num_channels[i]
|
|
|
|
# Causal dilated convolution
|
|
layers.append(
|
|
nn.Conv1d(in_channels, out_channels, kernel_size,
|
|
padding=(kernel_size-1) * dilation_size,
|
|
dilation=dilation_size)
|
|
)
|
|
layers.append(nn.ReLU())
|
|
layers.append(nn.Dropout(0.2))
|
|
|
|
self.network = nn.Sequential(*layers)
|
|
|
|
def forward(self, x):
|
|
return self.network(x)
|
|
|
|
# Usage for time series
|
|
# Input: (batch, channels, sequence_length)
|
|
model = TCN(input_channels=1, num_channels=[64, 128, 256])
|
|
```
|
|
|
|
**Key Insight:** Dilated convolutions create exponentially large receptive field (2^k) while maintaining linear memory
|
|
|
|
**Status:** **Excellent for time series**, underrated, should be considered before LSTM
|
|
|
|
|
|
### 6. Sparse Transformers - Long Sequence Specialists
|
|
|
|
**Architecture:** Modified attention patterns to reduce complexity
|
|
|
|
**Variants:**
|
|
- **Longformer**: Local + global attention
|
|
- **BigBird**: Random + local + global attention
|
|
- **Linformer**: Low-rank projection of keys/values
|
|
- **Performer**: Kernel approximation of attention
|
|
|
|
**Complexity:** O(n log n) or O(n) depending on variant
|
|
|
|
**When to Use:**
|
|
- ✅ **Long sequences** (1000-10000 tokens)
|
|
- ✅ Document processing (multi-page documents)
|
|
- ✅ Long-context language modeling
|
|
- ✅ When standard Transformer runs out of memory
|
|
|
|
**Trade-offs:**
|
|
- Slightly lower accuracy than full attention (approximation)
|
|
- More complex implementation
|
|
- Fewer pre-trained models
|
|
|
|
**Example Use Cases:**
|
|
- Legal document analysis (10k+ tokens)
|
|
- Scientific paper understanding
|
|
- Long-form text generation
|
|
- Time series with thousands of steps
|
|
|
|
**Status:** Specialized for long sequences, active research area
|
|
|
|
|
|
### 7. State Space Models (S4) - Cutting Edge
|
|
|
|
**Architecture:** Structured state space with efficient recurrence
|
|
|
|
**Complexity:** O(n log n) training, O(n) inference
|
|
|
|
**Strengths:**
|
|
- ✅ **Very long sequences** (10k-100k steps)
|
|
- ✅ Linear inference complexity
|
|
- ✅ Strong theoretical foundations
|
|
- ✅ Handles continuous-time sequences
|
|
|
|
**Weaknesses:**
|
|
- ❌ Newer (less mature ecosystem)
|
|
- ❌ Complex mathematics
|
|
- ❌ Fewer pre-trained models
|
|
- ❌ Harder to implement
|
|
|
|
**When to Use:**
|
|
- ✅ Extremely long sequences (> 10k steps)
|
|
- ✅ Audio (raw waveforms, 16kHz sampling)
|
|
- ✅ Medical signals (ECG, EEG)
|
|
- ✅ Research applications
|
|
|
|
**Status:** **Cutting edge** (2022+), promising for very long sequences
|
|
|
|
|
|
## Practical Selection Guide
|
|
|
|
### Scenario 1: Natural Language Processing
|
|
|
|
**Short text (< 50 tokens, e.g., tweets, titles):**
|
|
```
|
|
Small dataset (< 10k):
|
|
→ BiLSTM or 1D CNN (simple, effective)
|
|
|
|
Large dataset (> 10k):
|
|
→ DistilBERT (smaller Transformer, 40M params)
|
|
→ Or BiLSTM if latency critical
|
|
```
|
|
|
|
**Medium text (50-512 tokens, e.g., reviews, articles):**
|
|
```
|
|
Standard approach:
|
|
→ BERT, RoBERTa, or similar (110M params)
|
|
→ Fine-tune on task-specific data
|
|
|
|
Small dataset:
|
|
→ DistilBERT (66M params, faster, similar accuracy)
|
|
```
|
|
|
|
**Long documents (> 512 tokens):**
|
|
```
|
|
→ Longformer (4096 tokens max)
|
|
→ BigBird (4096 tokens max)
|
|
→ Hierarchical: Process in chunks, aggregate
|
|
```
|
|
|
|
|
|
### Scenario 2: Time Series Forecasting
|
|
|
|
**Short sequences (< 100 steps):**
|
|
```
|
|
Fast training:
|
|
→ TCN (2-3x faster than LSTM)
|
|
|
|
Small dataset:
|
|
→ LSTM or simple models (ARIMA, Prophet)
|
|
|
|
Baseline:
|
|
→ LSTM (well-tested)
|
|
```
|
|
|
|
**Medium sequences (100-1000 steps):**
|
|
```
|
|
Best accuracy:
|
|
→ Transformer (if data > 50k examples)
|
|
|
|
Fast training/inference:
|
|
→ TCN (parallel processing)
|
|
|
|
Multivariate:
|
|
→ Transformer with cross-series attention
|
|
```
|
|
|
|
**Long sequences (> 1000 steps):**
|
|
```
|
|
→ Sparse Transformer (Informer for time series)
|
|
→ Hierarchical models (chunk + aggregate)
|
|
→ State Space Models (S4)
|
|
```
|
|
|
|
|
|
### Scenario 3: Audio Processing
|
|
|
|
**Waveform (raw audio, 16kHz):**
|
|
```
|
|
→ WaveNet (TCN-based)
|
|
→ State Space Models (S4)
|
|
```
|
|
|
|
**Spectrograms (mel-spectrograms):**
|
|
```
|
|
→ CNN + BiLSTM (traditional)
|
|
→ CNN + Transformer (modern)
|
|
```
|
|
|
|
**Speech recognition:**
|
|
```
|
|
→ Transformer (Wav2Vec 2.0, Whisper)
|
|
→ Pre-trained models available
|
|
```
|
|
|
|
|
|
## Trade-Off Analysis
|
|
|
|
### Speed Comparison
|
|
|
|
**Training speed (1000-step sequences):**
|
|
```
|
|
LSTM: 100% (baseline, sequential)
|
|
GRU: 75% (simpler gates)
|
|
TCN: 35% (2.8x faster, parallel)
|
|
Transformer: 45% (2.2x faster, parallel)
|
|
|
|
Conclusion: TCN fastest for training
|
|
```
|
|
|
|
**Inference speed:**
|
|
```
|
|
LSTM: 100% (sequential)
|
|
BiLSTM: 200% (2x passes)
|
|
TCN: 20% (5x faster, parallel)
|
|
Transformer: 60% (faster, but attention overhead)
|
|
|
|
Conclusion: TCN fastest for inference
|
|
```
|
|
|
|
|
|
### Memory Comparison
|
|
|
|
**Sequence length n=1000, batch=32:**
|
|
```
|
|
LSTM: ~500 MB (linear in n)
|
|
Transformer: ~2 GB (quadratic in n)
|
|
TCN: ~400 MB (linear in n)
|
|
Sparse Transformer: ~800 MB (n log n)
|
|
|
|
For n=5000:
|
|
LSTM: ~2 GB
|
|
Transformer: OUT OF MEMORY (50 GB needed!)
|
|
TCN: ~2 GB
|
|
Sparse Transformer: ~4 GB
|
|
```
|
|
|
|
|
|
### Accuracy vs Data Size
|
|
|
|
**Small dataset (< 10k examples):**
|
|
```
|
|
LSTM: ★★★★☆ (works well with little data)
|
|
Transformer: ★★☆☆☆ (overfits, needs more data)
|
|
TCN: ★★★★☆ (similar to LSTM)
|
|
|
|
Winner: LSTM or TCN
|
|
```
|
|
|
|
**Large dataset (> 100k examples):**
|
|
```
|
|
LSTM: ★★★☆☆ (good but plateaus)
|
|
Transformer: ★★★★★ (best, scales with data)
|
|
TCN: ★★★★☆ (competitive)
|
|
|
|
Winner: Transformer
|
|
```
|
|
|
|
|
|
## Common Pitfalls
|
|
|
|
### Pitfall 1: Using LSTM in 2025 Without Considering Modern Alternatives
|
|
**Symptom:** Defaulting to LSTM for all sequence tasks
|
|
|
|
**Why it's wrong:** Transformers (language) and TCN (time series) often better
|
|
|
|
**Fix:** Consider Transformer for language, TCN for time series, LSTM for small data/edge only
|
|
|
|
|
|
### Pitfall 2: Using Standard Transformer for Very Long Sequences
|
|
**Symptom:** Running out of memory on sequences > 1000 tokens
|
|
|
|
**Why it's wrong:** O(n²) memory explodes
|
|
|
|
**Fix:** Use Sparse Transformer (Longformer, BigBird) or hierarchical approach
|
|
|
|
|
|
### Pitfall 3: Not Trying TCN for Time Series
|
|
**Symptom:** Struggling with slow LSTM training
|
|
|
|
**Why it's wrong:** TCN is 2-3x faster, often more accurate
|
|
|
|
**Fix:** Try TCN before optimizing LSTM
|
|
|
|
|
|
### Pitfall 4: Using Transformer for Small Datasets
|
|
**Symptom:** Transformer overfits on < 10k examples
|
|
|
|
**Why it's wrong:** Transformers need large datasets to work well
|
|
|
|
**Fix:** Use LSTM or TCN for small datasets, or use pre-trained Transformer
|
|
|
|
|
|
### Pitfall 5: Ignoring Sequence Length Constraints
|
|
**Symptom:** Choosing architecture without considering typical sequence length
|
|
|
|
**Why it's wrong:** Architecture effectiveness varies dramatically with length
|
|
|
|
**Fix:** Match architecture to sequence length (short → LSTM/CNN, long → Sparse Transformer)
|
|
|
|
|
|
## Evolution Timeline
|
|
|
|
**Understanding why architectures evolved:**
|
|
|
|
```
|
|
2010-2013: Basic RNN
|
|
→ Vanishing gradient problem
|
|
→ Can't learn long dependencies
|
|
|
|
2014: LSTM (Hochreiter & Schmidhuber)
|
|
→ Gates solve vanishing gradient
|
|
→ Became standard for sequences
|
|
|
|
2014: GRU
|
|
→ Simplified LSTM
|
|
→ Similar performance, fewer parameters
|
|
|
|
2017: Transformer (Attention Is All You Need)
|
|
→ Self-attention replaces recurrence
|
|
→ Parallel processing (fast training)
|
|
→ Revolutionized NLP
|
|
|
|
2018: TCN (Temporal Convolutional Networks)
|
|
→ Dilated convolutions for sequences
|
|
→ Often better than LSTM for time series
|
|
→ Underrated alternative
|
|
|
|
2020: Sparse Transformers
|
|
→ Reduce quadratic complexity
|
|
→ Enable longer sequences
|
|
|
|
2021: State Space Models (S4)
|
|
→ Very long sequences (10k-100k)
|
|
→ Theoretical foundations
|
|
→ Cutting edge research
|
|
|
|
Current (2025):
|
|
- NLP: Transformer standard (BERT, GPT)
|
|
- Time Series: TCN or Transformer
|
|
- Audio: Specialized (WaveNet, Transformer)
|
|
- Edge: LSTM or TCN (low memory)
|
|
```
|
|
|
|
|
|
## Decision Checklist
|
|
|
|
Before choosing sequence model:
|
|
|
|
```
|
|
☐ Sequence length? (< 100 / 100-1k / > 1k)
|
|
☐ Data type? (language / time series / audio / other)
|
|
☐ Dataset size? (< 10k / 10k-100k / > 100k)
|
|
☐ Latency requirement? (real-time / batch / offline)
|
|
☐ Deployment target? (cloud / edge / mobile)
|
|
☐ Pre-trained models available? (yes / no)
|
|
☐ Training speed critical? (yes / no)
|
|
|
|
Based on answers:
|
|
→ Language + large data → Transformer
|
|
→ Language + small data → BiLSTM or DistilBERT
|
|
→ Time series + speed → TCN
|
|
→ Time series + accuracy + large data → Transformer
|
|
→ Very long sequences → Sparse Transformer or S4
|
|
→ Edge deployment → TCN or LSTM
|
|
→ Real-time latency → TCN
|
|
```
|
|
|
|
|
|
## Integration with Other Skills
|
|
|
|
**For language-specific questions:**
|
|
→ `yzmir/llm-specialist/using-llm-specialist`
|
|
- LLM-specific Transformers (GPT, BERT variants)
|
|
- Fine-tuning strategies
|
|
- Prompt engineering
|
|
|
|
**For Transformer internals:**
|
|
→ `yzmir/neural-architectures/transformer-architecture-deepdive`
|
|
- Attention mechanisms
|
|
- Positional encoding
|
|
- Transformer variants
|
|
|
|
**After selecting architecture:**
|
|
→ `yzmir/training-optimization/using-training-optimization`
|
|
- Optimizer selection
|
|
- Learning rate schedules
|
|
- Handling sequence-specific training issues
|
|
|
|
|
|
## Summary
|
|
|
|
**Quick Reference Table:**
|
|
|
|
| Use Case | Best Choice | Alternative | Avoid |
|
|
|----------|-------------|-------------|-------|
|
|
| Short text (< 50 tokens) | BiLSTM, DistilBERT | 1D CNN | Full BERT (overkill) |
|
|
| Long text (> 512 tokens) | Longformer, BigBird | Hierarchical | Standard BERT (memory) |
|
|
| Time series (< 1k steps) | TCN, Transformer | LSTM | Basic RNN |
|
|
| Time series (> 1k steps) | Sparse Transformer, S4 | Hierarchical | Standard Transformer |
|
|
| Small dataset (< 10k) | LSTM, TCN | Simple models | Transformer (overfits) |
|
|
| Large dataset (> 100k) | Transformer | TCN | LSTM (plateaus) |
|
|
| Edge deployment | TCN, LSTM | Quantized Transformer | Large Transformer |
|
|
| Real-time inference | TCN | Small LSTM | BiLSTM, Transformer |
|
|
|
|
**Key Principles:**
|
|
1. **Don't default to LSTM** (outdated for most tasks)
|
|
2. **Transformer for language** (current standard, if data sufficient)
|
|
3. **TCN for time series** (fast, effective, underrated)
|
|
4. **Match to sequence length** (short → LSTM/CNN, long → Sparse Transformer)
|
|
5. **Consider modern alternatives** (don't stop at LSTM vs Transformer)
|
|
|
|
|
|
**END OF SKILL**
|