Files
gh-tachyon-beep-skillpacks-…/skills/using-neural-architectures/cnn-families-and-selection.md
2025-11-30 09:00:00 +08:00

19 KiB
Raw Blame History

CNN Families and Selection: Choosing the Right Convolutional Network

<CRITICAL_CONTEXT> CNNs are the foundation of computer vision. Different families have vastly different trade-offs:

  • Accuracy vs Speed vs Size
  • Dataset size requirements
  • Deployment target (cloud vs edge vs mobile)
  • Task type (classification vs detection vs segmentation)

This skill helps you choose the RIGHT CNN for YOUR constraints. </CRITICAL_CONTEXT>

When to Use This Skill

Use this skill when:

  • Selecting CNN for vision task (classification, detection, segmentation)
  • Comparing CNN families (ResNet vs EfficientNet vs MobileNet)
  • Optimizing for specific constraints (latency, size, accuracy)
  • Understanding CNN evolution (why newer architectures exist)
  • Deployment-specific selection (cloud, edge, mobile)

DO NOT use for:

  • Non-vision tasks (use sequence-models-comparison or other skills)
  • Training optimization (use training-optimization pack)
  • Implementation details (use pytorch-engineering pack)

When in doubt: If choosing WHICH CNN → this skill. If implementing/training CNN → other skills.

Selection Framework

Step 1: Identify Constraints

Before recommending ANY architecture, ask:

Constraint Question Impact
Deployment Where will model run? Cloud → Any, Edge → MobileNet/EfficientNet-Lite, Mobile → MobileNetV3
Latency Speed requirement? Real-time (< 10ms) → MobileNet, Batch (> 100ms) → Any
Model Size Parameter/memory budget? < 10M params → MobileNet, < 50M → ResNet/EfficientNet, Any → Large models OK
Dataset Size Training images? < 10k → Small models, 10k-100k → Medium, > 100k → Large
Accuracy Required accuracy? Competitive → EfficientNet-B4+, Production → ResNet-50/EfficientNet-B2
Task Type Classification/detection/segmentation? Detection → FPN-compatible, Segmentation → Multi-scale

Critical: Get answers to these BEFORE recommending architecture.

Step 2: Apply Decision Tree

START: What's your primary constraint?

┌─ DEPLOYMENT TARGET
│  ├─ Cloud / Server
│  │  └─ Dataset size?
│  │     ├─ Small (< 10k) → ResNet-18, EfficientNet-B0
│  │     ├─ Medium (10k-100k) → ResNet-50, EfficientNet-B2
│  │     └─ Large (> 100k) → ResNet-101, EfficientNet-B4, ViT
│  │
│  ├─ Edge Device (Jetson, Coral)
│  │  └─ Latency requirement?
│  │     ├─ Real-time (< 10ms) → MobileNetV3-Small, EfficientNet-Lite0
│  │     ├─ Medium (10-50ms) → MobileNetV3-Large, EfficientNet-Lite2
│  │     └─ Relaxed (> 50ms) → EfficientNet-B0, ResNet-18
│  │
│  └─ Mobile (iOS/Android)
│     └─ MobileNetV3-Small (fastest), MobileNetV3-Large (balanced)
│        + INT8 quantization (route to ml-production)
│
├─ ACCURACY PRIORITY (cloud deployment assumed)
│  ├─ Maximum accuracy → EfficientNet-B7, ResNet-152, ViT-Large
│  ├─ Balanced → EfficientNet-B2/B3, ResNet-50
│  └─ Fast training → ResNet-18, EfficientNet-B0
│
├─ EFFICIENCY PRIORITY
│  └─ Best accuracy per FLOP → EfficientNet family (B0-B7)
│     (EfficientNet dominates ResNet on Pareto frontier)
│
└─ TASK TYPE
   ├─ Classification → Any CNN (use constraint-based selection above)
   ├─ Object Detection → ResNet + FPN, EfficientDet, YOLOv8 (CSPDarknet)
   └─ Segmentation → ResNet + U-Net, EfficientNet + DeepLabV3

CNN Family Catalog

1. ResNet Family (2015) - The Standard Baseline

Architecture: Residual connections (skip connections) enable very deep networks

Variants:

  • ResNet-18: 11M params, 1.8 GFLOPs, 69.8% ImageNet
  • ResNet-34: 22M params, 3.7 GFLOPs, 73.3% ImageNet
  • ResNet-50: 25M params, 4.1 GFLOPs, 76.1% ImageNet
  • ResNet-101: 44M params, 7.8 GFLOPs, 77.4% ImageNet
  • ResNet-152: 60M params, 11.6 GFLOPs, 78.3% ImageNet

When to Use:

  • Baseline choice: Well-tested, widely supported
  • Transfer learning: Excellent pre-trained weights available
  • Object detection: Standard backbone for Faster R-CNN, Mask R-CNN
  • Interpretability: Simple architecture, easy to understand

When NOT to Use:

  • Edge/mobile deployment: Too large and slow
  • Efficiency priority: EfficientNet beats ResNet on accuracy/FLOP
  • Small datasets (< 10k): Use ResNet-18, not ResNet-50+

Key Insight: Skip connections solve vanishing gradient, enable depth

Code Example:

import torchvision.models as models

# For cloud/server (good dataset)
model = models.resnet50(pretrained=True)

# For small dataset or faster training
model = models.resnet18(pretrained=True)

# For maximum accuracy (cloud only)
model = models.resnet101(pretrained=True)

2. EfficientNet Family (2019) - Best Efficiency

Architecture: Compound scaling (depth + width + resolution) optimized via neural architecture search

Variants:

  • EfficientNet-B0: 5M params, 0.4 GFLOPs, 77.3% ImageNet
  • EfficientNet-B1: 8M params, 0.7 GFLOPs, 79.2% ImageNet
  • EfficientNet-B2: 9M params, 1.0 GFLOPs, 80.3% ImageNet
  • EfficientNet-B3: 12M params, 1.8 GFLOPs, 81.7% ImageNet
  • EfficientNet-B4: 19M params, 4.2 GFLOPs, 82.9% ImageNet
  • EfficientNet-B7: 66M params, 37 GFLOPs, 84.4% ImageNet

When to Use:

  • Efficiency matters: Best accuracy per FLOP/parameter
  • Cloud deployment: B2-B4 sweet spot for production
  • Limited compute: B0 matches ResNet-50 accuracy at 10x fewer FLOPs
  • Scaling needs: Want to scale model up/down systematically

When NOT to Use:

  • Real-time mobile: Use MobileNet (EfficientNet has more layers)
  • Very small datasets: Can overfit despite efficiency
  • Simplicity needed: More complex than ResNet

Key Insight: Compound scaling balances depth, width, and resolution optimally

Efficiency Comparison:

Same accuracy as ResNet-50 (76%):
- ResNet-50: 25M params, 4.1 GFLOPs
- EfficientNet-B0: 5M params, 0.4 GFLOPs (10x more efficient!)

Better accuracy (82.9%):
- ResNet-152: 60M params, 11.6 GFLOPs → 78.3% ImageNet
- EfficientNet-B4: 19M params, 4.2 GFLOPs → 82.9% ImageNet
  (Better accuracy with 3x fewer params and 3x less compute)

Code Example:

import timm  # PyTorch Image Models library

# Balanced choice (production)
model = timm.create_model('efficientnet_b2', pretrained=True)

# Efficiency priority (edge)
model = timm.create_model('efficientnet_b0', pretrained=True)

# Accuracy priority (research)
model = timm.create_model('efficientnet_b4', pretrained=True)

3. MobileNet Family (2017-2019) - Mobile Optimized

Architecture: Depthwise separable convolutions (drastically reduce compute)

Variants:

  • MobileNetV1: 4.2M params, 0.6 GFLOPs, 70.6% ImageNet
  • MobileNetV2: 3.5M params, 0.3 GFLOPs, 72.0% ImageNet
  • MobileNetV3-Small: 2.5M params, 0.06 GFLOPs, 67.4% ImageNet
  • MobileNetV3-Large: 5.4M params, 0.2 GFLOPs, 75.2% ImageNet

When to Use:

  • Mobile deployment: iOS/Android apps
  • Edge devices: Raspberry Pi, Jetson Nano
  • Real-time inference: < 100ms latency
  • Extreme efficiency: < 10M parameters budget

When NOT to Use:

  • Cloud deployment with no constraints: EfficientNet or ResNet better accuracy
  • Accuracy priority: Sacrifices accuracy for speed
  • Large datasets with compute: Can afford better models

Key Insight: Depthwise separable convolutions = standard conv split into depthwise + pointwise (9x fewer operations)

Deployment Performance:

Raspberry Pi 4 inference (224×224 image):
- ResNet-50: ~2000ms (unusable)
- ResNet-18: ~600ms (slow)
- MobileNetV2: ~150ms (acceptable)
- MobileNetV3-Large: ~80ms (good)
- MobileNetV3-Small: ~40ms (fast)

With INT8 quantization:
- MobileNetV3-Large: ~30ms (production-ready)
- MobileNetV3-Small: ~15ms (real-time)

Code Example:

import torchvision.models as models

# For mobile deployment
model = models.mobilenet_v3_large(pretrained=True)

# For ultra-low latency (sacrifice accuracy)
model = models.mobilenet_v3_small(pretrained=True)

# Quantization for mobile (route to ml-production skill for details)
# Achieves 2-4x speedup with minimal accuracy loss

4. Inception Family (2014-2016) - Multi-Scale Features

Architecture: Multi-scale convolutions in parallel (inception modules)

Variants:

  • InceptionV3: 24M params, 5.7 GFLOPs, 77.5% ImageNet
  • InceptionV4: 42M params, 12.3 GFLOPs, 80.0% ImageNet
  • Inception-ResNet: Hybrid with residual connections

When to Use:

  • Multi-scale features: Objects at different sizes
  • Object detection: Good backbone for detection
  • Historical interest: Understanding multi-scale approaches

When NOT to Use:

  • Simplicity needed: Complex architecture, hard to modify
  • Efficiency priority: EfficientNet better
  • Modern projects: Largely superseded by ResNet/EfficientNet

Key Insight: Parallel multi-scale convolutions (1×1, 3×3, 5×5) capture different receptive fields

Status: Mostly historical - ResNet and EfficientNet have replaced Inception in practice

5. DenseNet Family (2017) - Dense Connections

Architecture: Every layer connects to every other layer (dense connections)

Variants:

  • DenseNet-121: 8M params, 2.9 GFLOPs, 74.4% ImageNet
  • DenseNet-169: 14M params, 3.4 GFLOPs, 75.6% ImageNet
  • DenseNet-201: 20M params, 4.3 GFLOPs, 76.9% ImageNet

When to Use:

  • Parameter efficiency: Good accuracy with few parameters
  • Feature reuse: Dense connections enable feature reuse
  • Small datasets: Better gradient flow helps with limited data

When NOT to Use:

  • Inference speed priority: Dense connections slow (high memory bandwidth)
  • Training speed: Slower to train than ResNet
  • Production deployment: Less mature ecosystem than ResNet

Key Insight: Dense connections improve gradient flow, enable feature reuse, but slow inference

Status: Theoretically elegant, but ResNet/EfficientNet more practical

6. VGG Family (2014) - Historical Baseline

Architecture: Very deep (16-19 layers), small 3×3 convolutions, many parameters

Variants:

  • VGG-16: 138M params, 15.5 GFLOPs, 71.5% ImageNet
  • VGG-19: 144M params, 19.6 GFLOPs, 71.1% ImageNet

When to Use:

  • DON'T use VGG for new projects
  • Historical understanding only

Why NOT to Use:

  • Massive parameter count (138M vs ResNet-50's 25M)
  • Poor accuracy for size
  • Superseded by ResNet (2015)

Key Insight: Proved that depth matters, but skip connections (ResNet) are better

Status: Obsolete - use ResNet or EfficientNet instead

Practical Selection Guide

Scenario 1: Cloud/Server Deployment

Goal: Best accuracy, no compute constraints

Recommendation:

Small dataset (< 10k images):
→ EfficientNet-B0 or ResNet-18
  (Avoid overfitting with smaller model)

Medium dataset (10k-100k images):
→ EfficientNet-B2 or ResNet-50
  (Balanced accuracy and efficiency)

Large dataset (> 100k images):
→ EfficientNet-B4 or ResNet-101
  (Can afford larger model)

Maximum accuracy (research):
→ EfficientNet-B7 or Vision Transformer
  (If dataset > 1M images and compute unlimited)

Scenario 2: Edge Deployment (Jetson, Coral TPU)

Goal: Optimize for edge hardware latency

Recommendation:

Real-time requirement (< 10ms):
→ MobileNetV3-Small or EfficientNet-Lite0
  + INT8 quantization

Medium latency (10-50ms):
→ MobileNetV3-Large or EfficientNet-Lite2

Relaxed latency (> 50ms):
→ EfficientNet-B0 or ResNet-18

Critical: Profile on actual edge hardware. Quantization is mandatory (route to ml-production).

Scenario 3: Mobile Deployment (iOS/Android)

Goal: On-device inference, minimal battery drain

Recommendation:

All mobile deployments:
→ MobileNetV3-Large (balanced)
→ MobileNetV3-Small (fastest, less accurate)

Always use:
- INT8 quantization (2-4x speedup)
- CoreML (iOS) or TFLite (Android) optimization
- Benchmark on target device before deploying

Expected latency (iPhone 12, INT8 quantized):

  • MobileNetV3-Small: 5-10ms
  • MobileNetV3-Large: 15-25ms

Scenario 4: Object Detection

Goal: Select backbone for detection framework

Recommendation:

Faster R-CNN:
→ ResNet-50 + FPN (standard)
→ ResNet-101 + FPN (more accuracy)

YOLOv8:
→ CSPDarknet (built-in, optimized)

EfficientDet:
→ EfficientNet + BiFPN (best efficiency)

Custom detection:
→ ResNet or EfficientNet as backbone
→ Add Feature Pyramid Network (FPN) for multi-scale

Note: Detection adds significant compute on top of backbone. Choose efficient backbone.

Scenario 5: Semantic Segmentation

Goal: Dense pixel-wise prediction

Recommendation:

U-Net style:
→ ResNet-18/34 as encoder (fast)
→ EfficientNet-B0 as encoder (efficient)

DeepLabV3:
→ ResNet-50 (standard)
→ MobileNetV3 (mobile deployment)

Key: Segmentation requires multi-scale features
→ Ensure backbone has skip connections or FPN

Trade-Off Analysis

Accuracy vs Efficiency (Pareto Frontier)

ImageNet Top-1 Accuracy vs FLOPs:

Efficiency Winners (best accuracy per FLOP):
1. EfficientNet-B0: 77.3% @ 0.4 GFLOPs (best efficiency)
2. EfficientNet-B2: 80.3% @ 1.0 GFLOPs
3. EfficientNet-B4: 82.9% @ 4.2 GFLOPs

Accuracy Winners (best absolute accuracy):
1. EfficientNet-B7: 84.4% @ 37 GFLOPs
2. ViT-Large: 85.2% @ 190 GFLOPs (requires huge dataset)
3. ResNet-152: 78.3% @ 11.6 GFLOPs (dominated by EfficientNet)

Speed Winners (lowest latency):
1. MobileNetV3-Small: 67.4% @ 0.06 GFLOPs (50ms on mobile)
2. MobileNetV3-Large: 75.2% @ 0.2 GFLOPs (100ms on mobile)
3. EfficientNet-Lite0: 75.0% @ 0.4 GFLOPs

Key Takeaway: EfficientNet dominates ResNet on Pareto frontier (better accuracy at same compute).

Parameters vs Accuracy

For same ~75% ImageNet accuracy:

VGG-16:           138M params (❌ terrible efficiency)
ResNet-50:         25M params
EfficientNet-B0:    5M params (✅ 5x fewer parameters!)
MobileNetV3-Large:  5M params (fast inference)

Conclusion: Modern architectures (EfficientNet, MobileNet) achieve same accuracy with far fewer parameters.

Common Pitfalls

Pitfall 1: Defaulting to ResNet-50

Symptom: Using ResNet-50 without considering alternatives

Why it's wrong: EfficientNet-B0 matches ResNet-50 accuracy with 10x less compute

Fix: Consider EfficientNet family first (better efficiency)

Pitfall 2: Choosing Large Model for Small Dataset

Symptom: Using ResNet-101 with < 10k images

Why it's wrong: Model will overfit (too many parameters for data)

Fix:

  • < 10k images → ResNet-18 or EfficientNet-B0
  • 10k-100k → ResNet-50 or EfficientNet-B2
  • 100k → Can use larger models

Pitfall 3: Using Desktop Model on Mobile

Symptom: Trying to run ResNet-50 on mobile device

Why it's wrong: 2000ms inference time is unusable

Fix: Use MobileNetV3 + quantization for mobile (15-30ms)

Pitfall 4: Ignoring Task Type

Symptom: Using standard CNN for object detection without FPN

Why it's wrong: Detection needs multi-scale features

Fix: Use detection-specific frameworks (YOLOv8, Faster R-CNN) with appropriate backbone

Pitfall 5: Believing "Bigger = Better"

Symptom: Choosing ResNet-152 over ResNet-50 without justification

Why it's wrong: Diminishing returns - 3x compute for 1.3% accuracy, will overfit on small data

Fix: Match model capacity to dataset size, consider efficiency

Evolution and Historical Context

Why CNNs evolved the way they did:

2012: AlexNet
→ Proved deep learning works for vision
→ 8 layers, 60M params

2014: VGG
→ Deeper is better (16-19 layers)
→ But: 138M params (too many)

2014: Inception/GoogLeNet
→ Multi-scale convolutions
→ More efficient than VGG

2015: ResNet ★
→ Skip connections enable very deep networks (152 layers)
→ Solved vanishing gradient problem
→ Became standard baseline

2017: MobileNet
→ Mobile deployment needs
→ Depthwise separable convolutions (9x fewer ops)

2017: DenseNet
→ Dense connections for feature reuse
→ Parameter efficient but slow inference

2019: EfficientNet ★
→ Compound scaling (depth + width + resolution)
→ Neural architecture search
→ Dominates Pareto frontier (best accuracy per FLOP)
→ New standard for efficiency

2020: Vision Transformer
→ Attention-based (no convolutions)
→ Requires very large datasets (> 1M images)
→ For research/large-scale applications

Current Recommendations (2025):

  • Cloud: EfficientNet (best efficiency) or ResNet (simplicity)
  • Edge: EfficientNet-Lite or MobileNetV3
  • Mobile: MobileNetV3 + quantization
  • Detection: EfficientDet or YOLOv8
  • Baseline: ResNet (simple, well-tested)

Decision Checklist

Before choosing CNN, answer these:

☐ Deployment target? (cloud/edge/mobile)
☐ Latency requirement? (< 10ms / 10-100ms / > 100ms)
☐ Model size budget? (< 10M / 10-50M / unlimited params)
☐ Dataset size? (< 10k / 10k-100k / > 100k images)
☐ Accuracy priority? (maximum / production / fast iteration)
☐ Task type? (classification / detection / segmentation)
☐ Efficiency matters? (yes → EfficientNet, no → flexibility)

Based on answers:
→ Mobile → MobileNetV3
→ Edge → EfficientNet-Lite or MobileNetV3
→ Cloud + efficiency → EfficientNet
→ Cloud + simplicity → ResNet
→ Maximum accuracy → EfficientNet-B7 or ViT
→ Small dataset → Small models (ResNet-18, EfficientNet-B0)

Integration with Other Skills

After selecting CNN architecture:

Training the model:yzmir/training-optimization/using-training-optimization

  • Optimizer selection (Adam, SGD, AdamW)
  • Learning rate schedules
  • Data augmentation

Implementing in PyTorch:yzmir/pytorch-engineering/using-pytorch-engineering

  • Custom modifications to pre-trained models
  • Multi-GPU training
  • Performance optimization

Deploying to production:yzmir/ml-production/using-ml-production

  • Quantization (INT8, FP16)
  • Model serving (TorchServe, ONNX)
  • Optimization for edge/mobile (TFLite, CoreML)

If architecture is unstable (very deep):yzmir/neural-architectures/normalization-techniques

  • Normalization layers (BatchNorm, LayerNorm)
  • Skip connections
  • Initialization strategies

Summary

CNN Selection in One Table:

Scenario Recommendation Why
Cloud, balanced EfficientNet-B2 Best efficiency, 80% accuracy
Cloud, max accuracy EfficientNet-B4 83% accuracy, reasonable compute
Cloud, simple baseline ResNet-50 Well-tested, widely used
Edge device MobileNetV3-Large Optimized for edge, 75% accuracy
Mobile app MobileNetV3-Small + quantization < 20ms inference
Small dataset (< 10k) ResNet-18 or EfficientNet-B0 Avoid overfitting
Object detection ResNet-50 + FPN, EfficientDet Multi-scale features
Segmentation ResNet + U-Net, DeepLabV3 Dense prediction

Key Principles:

  1. Match model capacity to dataset size (small data → small model)
  2. EfficientNet dominates ResNet on efficiency (same accuracy, less compute)
  3. Mobile needs mobile-specific architectures (MobileNet, quantization)
  4. Task type matters (detection/segmentation need multi-scale features)
  5. Bigger ≠ always better (diminishing returns, overfitting risk)

When in doubt: Start with EfficientNet-B2 (cloud) or MobileNetV3-Large (edge/mobile).

END OF SKILL