Files
gh-tachyon-beep-skillpacks-…/skills/using-neural-architectures/cnn-families-and-selection.md
2025-11-30 09:00:00 +08:00

623 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CNN Families and Selection: Choosing the Right Convolutional Network
<CRITICAL_CONTEXT>
CNNs are the foundation of computer vision. Different families have vastly different trade-offs:
- Accuracy vs Speed vs Size
- Dataset size requirements
- Deployment target (cloud vs edge vs mobile)
- Task type (classification vs detection vs segmentation)
This skill helps you choose the RIGHT CNN for YOUR constraints.
</CRITICAL_CONTEXT>
## When to Use This Skill
Use this skill when:
- ✅ Selecting CNN for vision task (classification, detection, segmentation)
- ✅ Comparing CNN families (ResNet vs EfficientNet vs MobileNet)
- ✅ Optimizing for specific constraints (latency, size, accuracy)
- ✅ Understanding CNN evolution (why newer architectures exist)
- ✅ Deployment-specific selection (cloud, edge, mobile)
DO NOT use for:
- ❌ Non-vision tasks (use sequence-models-comparison or other skills)
- ❌ Training optimization (use training-optimization pack)
- ❌ Implementation details (use pytorch-engineering pack)
**When in doubt:** If choosing WHICH CNN → this skill. If implementing/training CNN → other skills.
## Selection Framework
### Step 1: Identify Constraints
**Before recommending ANY architecture, ask:**
| Constraint | Question | Impact |
|------------|----------|--------|
| **Deployment** | Where will model run? | Cloud → Any, Edge → MobileNet/EfficientNet-Lite, Mobile → MobileNetV3 |
| **Latency** | Speed requirement? | Real-time (< 10ms) → MobileNet, Batch (> 100ms) → Any |
| **Model Size** | Parameter/memory budget? | < 10M params → MobileNet, < 50M → ResNet/EfficientNet, Any → Large models OK |
| **Dataset Size** | Training images? | < 10k → Small models, 10k-100k → Medium, > 100k → Large |
| **Accuracy** | Required accuracy? | Competitive → EfficientNet-B4+, Production → ResNet-50/EfficientNet-B2 |
| **Task Type** | Classification/detection/segmentation? | Detection → FPN-compatible, Segmentation → Multi-scale |
**Critical:** Get answers to these BEFORE recommending architecture.
### Step 2: Apply Decision Tree
```
START: What's your primary constraint?
┌─ DEPLOYMENT TARGET
│ ├─ Cloud / Server
│ │ └─ Dataset size?
│ │ ├─ Small (< 10k) → ResNet-18, EfficientNet-B0
│ │ ├─ Medium (10k-100k) → ResNet-50, EfficientNet-B2
│ │ └─ Large (> 100k) → ResNet-101, EfficientNet-B4, ViT
│ │
│ ├─ Edge Device (Jetson, Coral)
│ │ └─ Latency requirement?
│ │ ├─ Real-time (< 10ms) → MobileNetV3-Small, EfficientNet-Lite0
│ │ ├─ Medium (10-50ms) → MobileNetV3-Large, EfficientNet-Lite2
│ │ └─ Relaxed (> 50ms) → EfficientNet-B0, ResNet-18
│ │
│ └─ Mobile (iOS/Android)
│ └─ MobileNetV3-Small (fastest), MobileNetV3-Large (balanced)
│ + INT8 quantization (route to ml-production)
├─ ACCURACY PRIORITY (cloud deployment assumed)
│ ├─ Maximum accuracy → EfficientNet-B7, ResNet-152, ViT-Large
│ ├─ Balanced → EfficientNet-B2/B3, ResNet-50
│ └─ Fast training → ResNet-18, EfficientNet-B0
├─ EFFICIENCY PRIORITY
│ └─ Best accuracy per FLOP → EfficientNet family (B0-B7)
│ (EfficientNet dominates ResNet on Pareto frontier)
└─ TASK TYPE
├─ Classification → Any CNN (use constraint-based selection above)
├─ Object Detection → ResNet + FPN, EfficientDet, YOLOv8 (CSPDarknet)
└─ Segmentation → ResNet + U-Net, EfficientNet + DeepLabV3
```
## CNN Family Catalog
### 1. ResNet Family (2015) - The Standard Baseline
**Architecture:** Residual connections (skip connections) enable very deep networks
**Variants:**
- ResNet-18: 11M params, 1.8 GFLOPs, 69.8% ImageNet
- ResNet-34: 22M params, 3.7 GFLOPs, 73.3% ImageNet
- ResNet-50: 25M params, 4.1 GFLOPs, 76.1% ImageNet
- ResNet-101: 44M params, 7.8 GFLOPs, 77.4% ImageNet
- ResNet-152: 60M params, 11.6 GFLOPs, 78.3% ImageNet
**When to Use:**
-**Baseline choice**: Well-tested, widely supported
-**Transfer learning**: Excellent pre-trained weights available
-**Object detection**: Standard backbone for Faster R-CNN, Mask R-CNN
-**Interpretability**: Simple architecture, easy to understand
**When NOT to Use:**
-**Edge/mobile deployment**: Too large and slow
-**Efficiency priority**: EfficientNet beats ResNet on accuracy/FLOP
-**Small datasets (< 10k)**: Use ResNet-18, not ResNet-50+
**Key Insight:** Skip connections solve vanishing gradient, enable depth
**Code Example:**
```python
import torchvision.models as models
# For cloud/server (good dataset)
model = models.resnet50(pretrained=True)
# For small dataset or faster training
model = models.resnet18(pretrained=True)
# For maximum accuracy (cloud only)
model = models.resnet101(pretrained=True)
```
### 2. EfficientNet Family (2019) - Best Efficiency
**Architecture:** Compound scaling (depth + width + resolution) optimized via neural architecture search
**Variants:**
- EfficientNet-B0: 5M params, 0.4 GFLOPs, 77.3% ImageNet
- EfficientNet-B1: 8M params, 0.7 GFLOPs, 79.2% ImageNet
- EfficientNet-B2: 9M params, 1.0 GFLOPs, 80.3% ImageNet
- EfficientNet-B3: 12M params, 1.8 GFLOPs, 81.7% ImageNet
- EfficientNet-B4: 19M params, 4.2 GFLOPs, 82.9% ImageNet
- EfficientNet-B7: 66M params, 37 GFLOPs, 84.4% ImageNet
**When to Use:**
-**Efficiency matters**: Best accuracy per FLOP/parameter
-**Cloud deployment**: B2-B4 sweet spot for production
-**Limited compute**: B0 matches ResNet-50 accuracy at 10x fewer FLOPs
-**Scaling needs**: Want to scale model up/down systematically
**When NOT to Use:**
-**Real-time mobile**: Use MobileNet (EfficientNet has more layers)
-**Very small datasets**: Can overfit despite efficiency
-**Simplicity needed**: More complex than ResNet
**Key Insight:** Compound scaling balances depth, width, and resolution optimally
**Efficiency Comparison:**
```
Same accuracy as ResNet-50 (76%):
- ResNet-50: 25M params, 4.1 GFLOPs
- EfficientNet-B0: 5M params, 0.4 GFLOPs (10x more efficient!)
Better accuracy (82.9%):
- ResNet-152: 60M params, 11.6 GFLOPs → 78.3% ImageNet
- EfficientNet-B4: 19M params, 4.2 GFLOPs → 82.9% ImageNet
(Better accuracy with 3x fewer params and 3x less compute)
```
**Code Example:**
```python
import timm # PyTorch Image Models library
# Balanced choice (production)
model = timm.create_model('efficientnet_b2', pretrained=True)
# Efficiency priority (edge)
model = timm.create_model('efficientnet_b0', pretrained=True)
# Accuracy priority (research)
model = timm.create_model('efficientnet_b4', pretrained=True)
```
### 3. MobileNet Family (2017-2019) - Mobile Optimized
**Architecture:** Depthwise separable convolutions (drastically reduce compute)
**Variants:**
- MobileNetV1: 4.2M params, 0.6 GFLOPs, 70.6% ImageNet
- MobileNetV2: 3.5M params, 0.3 GFLOPs, 72.0% ImageNet
- MobileNetV3-Small: 2.5M params, 0.06 GFLOPs, 67.4% ImageNet
- MobileNetV3-Large: 5.4M params, 0.2 GFLOPs, 75.2% ImageNet
**When to Use:**
-**Mobile deployment**: iOS/Android apps
-**Edge devices**: Raspberry Pi, Jetson Nano
-**Real-time inference**: < 100ms latency
-**Extreme efficiency**: < 10M parameters budget
**When NOT to Use:**
-**Cloud deployment with no constraints**: EfficientNet or ResNet better accuracy
-**Accuracy priority**: Sacrifices accuracy for speed
-**Large datasets with compute**: Can afford better models
**Key Insight:** Depthwise separable convolutions = standard conv split into depthwise + pointwise (9x fewer operations)
**Deployment Performance:**
```
Raspberry Pi 4 inference (224×224 image):
- ResNet-50: ~2000ms (unusable)
- ResNet-18: ~600ms (slow)
- MobileNetV2: ~150ms (acceptable)
- MobileNetV3-Large: ~80ms (good)
- MobileNetV3-Small: ~40ms (fast)
With INT8 quantization:
- MobileNetV3-Large: ~30ms (production-ready)
- MobileNetV3-Small: ~15ms (real-time)
```
**Code Example:**
```python
import torchvision.models as models
# For mobile deployment
model = models.mobilenet_v3_large(pretrained=True)
# For ultra-low latency (sacrifice accuracy)
model = models.mobilenet_v3_small(pretrained=True)
# Quantization for mobile (route to ml-production skill for details)
# Achieves 2-4x speedup with minimal accuracy loss
```
### 4. Inception Family (2014-2016) - Multi-Scale Features
**Architecture:** Multi-scale convolutions in parallel (inception modules)
**Variants:**
- InceptionV3: 24M params, 5.7 GFLOPs, 77.5% ImageNet
- InceptionV4: 42M params, 12.3 GFLOPs, 80.0% ImageNet
- Inception-ResNet: Hybrid with residual connections
**When to Use:**
-**Multi-scale features**: Objects at different sizes
-**Object detection**: Good backbone for detection
-**Historical interest**: Understanding multi-scale approaches
**When NOT to Use:**
-**Simplicity needed**: Complex architecture, hard to modify
-**Efficiency priority**: EfficientNet better
-**Modern projects**: Largely superseded by ResNet/EfficientNet
**Key Insight:** Parallel multi-scale convolutions (1×1, 3×3, 5×5) capture different receptive fields
**Status:** Mostly historical - ResNet and EfficientNet have replaced Inception in practice
### 5. DenseNet Family (2017) - Dense Connections
**Architecture:** Every layer connects to every other layer (dense connections)
**Variants:**
- DenseNet-121: 8M params, 2.9 GFLOPs, 74.4% ImageNet
- DenseNet-169: 14M params, 3.4 GFLOPs, 75.6% ImageNet
- DenseNet-201: 20M params, 4.3 GFLOPs, 76.9% ImageNet
**When to Use:**
-**Parameter efficiency**: Good accuracy with few parameters
-**Feature reuse**: Dense connections enable feature reuse
-**Small datasets**: Better gradient flow helps with limited data
**When NOT to Use:**
-**Inference speed priority**: Dense connections slow (high memory bandwidth)
-**Training speed**: Slower to train than ResNet
-**Production deployment**: Less mature ecosystem than ResNet
**Key Insight:** Dense connections improve gradient flow, enable feature reuse, but slow inference
**Status:** Theoretically elegant, but ResNet/EfficientNet more practical
### 6. VGG Family (2014) - Historical Baseline
**Architecture:** Very deep (16-19 layers), small 3×3 convolutions, many parameters
**Variants:**
- VGG-16: 138M params, 15.5 GFLOPs, 71.5% ImageNet
- VGG-19: 144M params, 19.6 GFLOPs, 71.1% ImageNet
**When to Use:**
-**DON'T use VGG for new projects**
- Historical understanding only
**Why NOT to Use:**
- Massive parameter count (138M vs ResNet-50's 25M)
- Poor accuracy for size
- Superseded by ResNet (2015)
**Key Insight:** Proved that depth matters, but skip connections (ResNet) are better
**Status:** **Obsolete** - use ResNet or EfficientNet instead
## Practical Selection Guide
### Scenario 1: Cloud/Server Deployment
**Goal:** Best accuracy, no compute constraints
**Recommendation:**
```
Small dataset (< 10k images):
→ EfficientNet-B0 or ResNet-18
(Avoid overfitting with smaller model)
Medium dataset (10k-100k images):
→ EfficientNet-B2 or ResNet-50
(Balanced accuracy and efficiency)
Large dataset (> 100k images):
→ EfficientNet-B4 or ResNet-101
(Can afford larger model)
Maximum accuracy (research):
→ EfficientNet-B7 or Vision Transformer
(If dataset > 1M images and compute unlimited)
```
### Scenario 2: Edge Deployment (Jetson, Coral TPU)
**Goal:** Optimize for edge hardware latency
**Recommendation:**
```
Real-time requirement (< 10ms):
→ MobileNetV3-Small or EfficientNet-Lite0
+ INT8 quantization
Medium latency (10-50ms):
→ MobileNetV3-Large or EfficientNet-Lite2
Relaxed latency (> 50ms):
→ EfficientNet-B0 or ResNet-18
```
**Critical:** Profile on actual edge hardware. Quantization is mandatory (route to ml-production).
### Scenario 3: Mobile Deployment (iOS/Android)
**Goal:** On-device inference, minimal battery drain
**Recommendation:**
```
All mobile deployments:
→ MobileNetV3-Large (balanced)
→ MobileNetV3-Small (fastest, less accurate)
Always use:
- INT8 quantization (2-4x speedup)
- CoreML (iOS) or TFLite (Android) optimization
- Benchmark on target device before deploying
```
**Expected latency (iPhone 12, INT8 quantized):**
- MobileNetV3-Small: 5-10ms
- MobileNetV3-Large: 15-25ms
### Scenario 4: Object Detection
**Goal:** Select backbone for detection framework
**Recommendation:**
```
Faster R-CNN:
→ ResNet-50 + FPN (standard)
→ ResNet-101 + FPN (more accuracy)
YOLOv8:
→ CSPDarknet (built-in, optimized)
EfficientDet:
→ EfficientNet + BiFPN (best efficiency)
Custom detection:
→ ResNet or EfficientNet as backbone
→ Add Feature Pyramid Network (FPN) for multi-scale
```
**Note:** Detection adds significant compute on top of backbone. Choose efficient backbone.
### Scenario 5: Semantic Segmentation
**Goal:** Dense pixel-wise prediction
**Recommendation:**
```
U-Net style:
→ ResNet-18/34 as encoder (fast)
→ EfficientNet-B0 as encoder (efficient)
DeepLabV3:
→ ResNet-50 (standard)
→ MobileNetV3 (mobile deployment)
Key: Segmentation requires multi-scale features
→ Ensure backbone has skip connections or FPN
```
## Trade-Off Analysis
### Accuracy vs Efficiency (Pareto Frontier)
**ImageNet Top-1 Accuracy vs FLOPs:**
```
Efficiency Winners (best accuracy per FLOP):
1. EfficientNet-B0: 77.3% @ 0.4 GFLOPs (best efficiency)
2. EfficientNet-B2: 80.3% @ 1.0 GFLOPs
3. EfficientNet-B4: 82.9% @ 4.2 GFLOPs
Accuracy Winners (best absolute accuracy):
1. EfficientNet-B7: 84.4% @ 37 GFLOPs
2. ViT-Large: 85.2% @ 190 GFLOPs (requires huge dataset)
3. ResNet-152: 78.3% @ 11.6 GFLOPs (dominated by EfficientNet)
Speed Winners (lowest latency):
1. MobileNetV3-Small: 67.4% @ 0.06 GFLOPs (50ms on mobile)
2. MobileNetV3-Large: 75.2% @ 0.2 GFLOPs (100ms on mobile)
3. EfficientNet-Lite0: 75.0% @ 0.4 GFLOPs
```
**Key Takeaway:** EfficientNet dominates ResNet on Pareto frontier (better accuracy at same compute).
### Parameters vs Accuracy
**For same ~75% ImageNet accuracy:**
```
VGG-16: 138M params (❌ terrible efficiency)
ResNet-50: 25M params
EfficientNet-B0: 5M params (✅ 5x fewer parameters!)
MobileNetV3-Large: 5M params (fast inference)
```
**Conclusion:** Modern architectures (EfficientNet, MobileNet) achieve same accuracy with far fewer parameters.
## Common Pitfalls
### Pitfall 1: Defaulting to ResNet-50
**Symptom:** Using ResNet-50 without considering alternatives
**Why it's wrong:** EfficientNet-B0 matches ResNet-50 accuracy with 10x less compute
**Fix:** Consider EfficientNet family first (better efficiency)
### Pitfall 2: Choosing Large Model for Small Dataset
**Symptom:** Using ResNet-101 with < 10k images
**Why it's wrong:** Model will overfit (too many parameters for data)
**Fix:**
- < 10k images → ResNet-18 or EfficientNet-B0
- 10k-100k → ResNet-50 or EfficientNet-B2
- > 100k → Can use larger models
### Pitfall 3: Using Desktop Model on Mobile
**Symptom:** Trying to run ResNet-50 on mobile device
**Why it's wrong:** 2000ms inference time is unusable
**Fix:** Use MobileNetV3 + quantization for mobile (15-30ms)
### Pitfall 4: Ignoring Task Type
**Symptom:** Using standard CNN for object detection without FPN
**Why it's wrong:** Detection needs multi-scale features
**Fix:** Use detection-specific frameworks (YOLOv8, Faster R-CNN) with appropriate backbone
### Pitfall 5: Believing "Bigger = Better"
**Symptom:** Choosing ResNet-152 over ResNet-50 without justification
**Why it's wrong:** Diminishing returns - 3x compute for 1.3% accuracy, will overfit on small data
**Fix:** Match model capacity to dataset size, consider efficiency
## Evolution and Historical Context
**Why CNNs evolved the way they did:**
```
2012: AlexNet
→ Proved deep learning works for vision
→ 8 layers, 60M params
2014: VGG
→ Deeper is better (16-19 layers)
→ But: 138M params (too many)
2014: Inception/GoogLeNet
→ Multi-scale convolutions
→ More efficient than VGG
2015: ResNet ★
→ Skip connections enable very deep networks (152 layers)
→ Solved vanishing gradient problem
→ Became standard baseline
2017: MobileNet
→ Mobile deployment needs
→ Depthwise separable convolutions (9x fewer ops)
2017: DenseNet
→ Dense connections for feature reuse
→ Parameter efficient but slow inference
2019: EfficientNet ★
→ Compound scaling (depth + width + resolution)
→ Neural architecture search
→ Dominates Pareto frontier (best accuracy per FLOP)
→ New standard for efficiency
2020: Vision Transformer
→ Attention-based (no convolutions)
→ Requires very large datasets (> 1M images)
→ For research/large-scale applications
```
**Current Recommendations (2025):**
- Cloud: **EfficientNet** (best efficiency) or ResNet (simplicity)
- Edge: **EfficientNet-Lite** or MobileNetV3
- Mobile: **MobileNetV3** + quantization
- Detection: **EfficientDet** or YOLOv8
- Baseline: **ResNet** (simple, well-tested)
## Decision Checklist
Before choosing CNN, answer these:
```
☐ Deployment target? (cloud/edge/mobile)
☐ Latency requirement? (< 10ms / 10-100ms / > 100ms)
☐ Model size budget? (< 10M / 10-50M / unlimited params)
☐ Dataset size? (< 10k / 10k-100k / > 100k images)
☐ Accuracy priority? (maximum / production / fast iteration)
☐ Task type? (classification / detection / segmentation)
☐ Efficiency matters? (yes → EfficientNet, no → flexibility)
Based on answers:
→ Mobile → MobileNetV3
→ Edge → EfficientNet-Lite or MobileNetV3
→ Cloud + efficiency → EfficientNet
→ Cloud + simplicity → ResNet
→ Maximum accuracy → EfficientNet-B7 or ViT
→ Small dataset → Small models (ResNet-18, EfficientNet-B0)
```
## Integration with Other Skills
**After selecting CNN architecture:**
**Training the model:**
`yzmir/training-optimization/using-training-optimization`
- Optimizer selection (Adam, SGD, AdamW)
- Learning rate schedules
- Data augmentation
**Implementing in PyTorch:**
`yzmir/pytorch-engineering/using-pytorch-engineering`
- Custom modifications to pre-trained models
- Multi-GPU training
- Performance optimization
**Deploying to production:**
`yzmir/ml-production/using-ml-production`
- Quantization (INT8, FP16)
- Model serving (TorchServe, ONNX)
- Optimization for edge/mobile (TFLite, CoreML)
**If architecture is unstable (very deep):**
`yzmir/neural-architectures/normalization-techniques`
- Normalization layers (BatchNorm, LayerNorm)
- Skip connections
- Initialization strategies
## Summary
**CNN Selection in One Table:**
| Scenario | Recommendation | Why |
|----------|----------------|-----|
| Cloud, balanced | EfficientNet-B2 | Best efficiency, 80% accuracy |
| Cloud, max accuracy | EfficientNet-B4 | 83% accuracy, reasonable compute |
| Cloud, simple baseline | ResNet-50 | Well-tested, widely used |
| Edge device | MobileNetV3-Large | Optimized for edge, 75% accuracy |
| Mobile app | MobileNetV3-Small + quantization | < 20ms inference |
| Small dataset (< 10k) | ResNet-18 or EfficientNet-B0 | Avoid overfitting |
| Object detection | ResNet-50 + FPN, EfficientDet | Multi-scale features |
| Segmentation | ResNet + U-Net, DeepLabV3 | Dense prediction |
**Key Principles:**
1. **Match model capacity to dataset size** (small data → small model)
2. **EfficientNet dominates ResNet on efficiency** (same accuracy, less compute)
3. **Mobile needs mobile-specific architectures** (MobileNet, quantization)
4. **Task type matters** (detection/segmentation need multi-scale features)
5. **Bigger ≠ always better** (diminishing returns, overfitting risk)
**When in doubt:** Start with **EfficientNet-B2** (cloud) or **MobileNetV3-Large** (edge/mobile).
**END OF SKILL**