# CNN Families and Selection: Choosing the Right Convolutional Network CNNs are the foundation of computer vision. Different families have vastly different trade-offs: - Accuracy vs Speed vs Size - Dataset size requirements - Deployment target (cloud vs edge vs mobile) - Task type (classification vs detection vs segmentation) This skill helps you choose the RIGHT CNN for YOUR constraints. ## When to Use This Skill Use this skill when: - ✅ Selecting CNN for vision task (classification, detection, segmentation) - ✅ Comparing CNN families (ResNet vs EfficientNet vs MobileNet) - ✅ Optimizing for specific constraints (latency, size, accuracy) - ✅ Understanding CNN evolution (why newer architectures exist) - ✅ Deployment-specific selection (cloud, edge, mobile) DO NOT use for: - ❌ Non-vision tasks (use sequence-models-comparison or other skills) - ❌ Training optimization (use training-optimization pack) - ❌ Implementation details (use pytorch-engineering pack) **When in doubt:** If choosing WHICH CNN → this skill. If implementing/training CNN → other skills. ## Selection Framework ### Step 1: Identify Constraints **Before recommending ANY architecture, ask:** | Constraint | Question | Impact | |------------|----------|--------| | **Deployment** | Where will model run? | Cloud → Any, Edge → MobileNet/EfficientNet-Lite, Mobile → MobileNetV3 | | **Latency** | Speed requirement? | Real-time (< 10ms) → MobileNet, Batch (> 100ms) → Any | | **Model Size** | Parameter/memory budget? | < 10M params → MobileNet, < 50M → ResNet/EfficientNet, Any → Large models OK | | **Dataset Size** | Training images? | < 10k → Small models, 10k-100k → Medium, > 100k → Large | | **Accuracy** | Required accuracy? | Competitive → EfficientNet-B4+, Production → ResNet-50/EfficientNet-B2 | | **Task Type** | Classification/detection/segmentation? | Detection → FPN-compatible, Segmentation → Multi-scale | **Critical:** Get answers to these BEFORE recommending architecture. ### Step 2: Apply Decision Tree ``` START: What's your primary constraint? ┌─ DEPLOYMENT TARGET │ ├─ Cloud / Server │ │ └─ Dataset size? │ │ ├─ Small (< 10k) → ResNet-18, EfficientNet-B0 │ │ ├─ Medium (10k-100k) → ResNet-50, EfficientNet-B2 │ │ └─ Large (> 100k) → ResNet-101, EfficientNet-B4, ViT │ │ │ ├─ Edge Device (Jetson, Coral) │ │ └─ Latency requirement? │ │ ├─ Real-time (< 10ms) → MobileNetV3-Small, EfficientNet-Lite0 │ │ ├─ Medium (10-50ms) → MobileNetV3-Large, EfficientNet-Lite2 │ │ └─ Relaxed (> 50ms) → EfficientNet-B0, ResNet-18 │ │ │ └─ Mobile (iOS/Android) │ └─ MobileNetV3-Small (fastest), MobileNetV3-Large (balanced) │ + INT8 quantization (route to ml-production) │ ├─ ACCURACY PRIORITY (cloud deployment assumed) │ ├─ Maximum accuracy → EfficientNet-B7, ResNet-152, ViT-Large │ ├─ Balanced → EfficientNet-B2/B3, ResNet-50 │ └─ Fast training → ResNet-18, EfficientNet-B0 │ ├─ EFFICIENCY PRIORITY │ └─ Best accuracy per FLOP → EfficientNet family (B0-B7) │ (EfficientNet dominates ResNet on Pareto frontier) │ └─ TASK TYPE ├─ Classification → Any CNN (use constraint-based selection above) ├─ Object Detection → ResNet + FPN, EfficientDet, YOLOv8 (CSPDarknet) └─ Segmentation → ResNet + U-Net, EfficientNet + DeepLabV3 ``` ## CNN Family Catalog ### 1. ResNet Family (2015) - The Standard Baseline **Architecture:** Residual connections (skip connections) enable very deep networks **Variants:** - ResNet-18: 11M params, 1.8 GFLOPs, 69.8% ImageNet - ResNet-34: 22M params, 3.7 GFLOPs, 73.3% ImageNet - ResNet-50: 25M params, 4.1 GFLOPs, 76.1% ImageNet - ResNet-101: 44M params, 7.8 GFLOPs, 77.4% ImageNet - ResNet-152: 60M params, 11.6 GFLOPs, 78.3% ImageNet **When to Use:** - ✅ **Baseline choice**: Well-tested, widely supported - ✅ **Transfer learning**: Excellent pre-trained weights available - ✅ **Object detection**: Standard backbone for Faster R-CNN, Mask R-CNN - ✅ **Interpretability**: Simple architecture, easy to understand **When NOT to Use:** - ❌ **Edge/mobile deployment**: Too large and slow - ❌ **Efficiency priority**: EfficientNet beats ResNet on accuracy/FLOP - ❌ **Small datasets (< 10k)**: Use ResNet-18, not ResNet-50+ **Key Insight:** Skip connections solve vanishing gradient, enable depth **Code Example:** ```python import torchvision.models as models # For cloud/server (good dataset) model = models.resnet50(pretrained=True) # For small dataset or faster training model = models.resnet18(pretrained=True) # For maximum accuracy (cloud only) model = models.resnet101(pretrained=True) ``` ### 2. EfficientNet Family (2019) - Best Efficiency **Architecture:** Compound scaling (depth + width + resolution) optimized via neural architecture search **Variants:** - EfficientNet-B0: 5M params, 0.4 GFLOPs, 77.3% ImageNet - EfficientNet-B1: 8M params, 0.7 GFLOPs, 79.2% ImageNet - EfficientNet-B2: 9M params, 1.0 GFLOPs, 80.3% ImageNet - EfficientNet-B3: 12M params, 1.8 GFLOPs, 81.7% ImageNet - EfficientNet-B4: 19M params, 4.2 GFLOPs, 82.9% ImageNet - EfficientNet-B7: 66M params, 37 GFLOPs, 84.4% ImageNet **When to Use:** - ✅ **Efficiency matters**: Best accuracy per FLOP/parameter - ✅ **Cloud deployment**: B2-B4 sweet spot for production - ✅ **Limited compute**: B0 matches ResNet-50 accuracy at 10x fewer FLOPs - ✅ **Scaling needs**: Want to scale model up/down systematically **When NOT to Use:** - ❌ **Real-time mobile**: Use MobileNet (EfficientNet has more layers) - ❌ **Very small datasets**: Can overfit despite efficiency - ❌ **Simplicity needed**: More complex than ResNet **Key Insight:** Compound scaling balances depth, width, and resolution optimally **Efficiency Comparison:** ``` Same accuracy as ResNet-50 (76%): - ResNet-50: 25M params, 4.1 GFLOPs - EfficientNet-B0: 5M params, 0.4 GFLOPs (10x more efficient!) Better accuracy (82.9%): - ResNet-152: 60M params, 11.6 GFLOPs → 78.3% ImageNet - EfficientNet-B4: 19M params, 4.2 GFLOPs → 82.9% ImageNet (Better accuracy with 3x fewer params and 3x less compute) ``` **Code Example:** ```python import timm # PyTorch Image Models library # Balanced choice (production) model = timm.create_model('efficientnet_b2', pretrained=True) # Efficiency priority (edge) model = timm.create_model('efficientnet_b0', pretrained=True) # Accuracy priority (research) model = timm.create_model('efficientnet_b4', pretrained=True) ``` ### 3. MobileNet Family (2017-2019) - Mobile Optimized **Architecture:** Depthwise separable convolutions (drastically reduce compute) **Variants:** - MobileNetV1: 4.2M params, 0.6 GFLOPs, 70.6% ImageNet - MobileNetV2: 3.5M params, 0.3 GFLOPs, 72.0% ImageNet - MobileNetV3-Small: 2.5M params, 0.06 GFLOPs, 67.4% ImageNet - MobileNetV3-Large: 5.4M params, 0.2 GFLOPs, 75.2% ImageNet **When to Use:** - ✅ **Mobile deployment**: iOS/Android apps - ✅ **Edge devices**: Raspberry Pi, Jetson Nano - ✅ **Real-time inference**: < 100ms latency - ✅ **Extreme efficiency**: < 10M parameters budget **When NOT to Use:** - ❌ **Cloud deployment with no constraints**: EfficientNet or ResNet better accuracy - ❌ **Accuracy priority**: Sacrifices accuracy for speed - ❌ **Large datasets with compute**: Can afford better models **Key Insight:** Depthwise separable convolutions = standard conv split into depthwise + pointwise (9x fewer operations) **Deployment Performance:** ``` Raspberry Pi 4 inference (224×224 image): - ResNet-50: ~2000ms (unusable) - ResNet-18: ~600ms (slow) - MobileNetV2: ~150ms (acceptable) - MobileNetV3-Large: ~80ms (good) - MobileNetV3-Small: ~40ms (fast) With INT8 quantization: - MobileNetV3-Large: ~30ms (production-ready) - MobileNetV3-Small: ~15ms (real-time) ``` **Code Example:** ```python import torchvision.models as models # For mobile deployment model = models.mobilenet_v3_large(pretrained=True) # For ultra-low latency (sacrifice accuracy) model = models.mobilenet_v3_small(pretrained=True) # Quantization for mobile (route to ml-production skill for details) # Achieves 2-4x speedup with minimal accuracy loss ``` ### 4. Inception Family (2014-2016) - Multi-Scale Features **Architecture:** Multi-scale convolutions in parallel (inception modules) **Variants:** - InceptionV3: 24M params, 5.7 GFLOPs, 77.5% ImageNet - InceptionV4: 42M params, 12.3 GFLOPs, 80.0% ImageNet - Inception-ResNet: Hybrid with residual connections **When to Use:** - ✅ **Multi-scale features**: Objects at different sizes - ✅ **Object detection**: Good backbone for detection - ✅ **Historical interest**: Understanding multi-scale approaches **When NOT to Use:** - ❌ **Simplicity needed**: Complex architecture, hard to modify - ❌ **Efficiency priority**: EfficientNet better - ❌ **Modern projects**: Largely superseded by ResNet/EfficientNet **Key Insight:** Parallel multi-scale convolutions (1×1, 3×3, 5×5) capture different receptive fields **Status:** Mostly historical - ResNet and EfficientNet have replaced Inception in practice ### 5. DenseNet Family (2017) - Dense Connections **Architecture:** Every layer connects to every other layer (dense connections) **Variants:** - DenseNet-121: 8M params, 2.9 GFLOPs, 74.4% ImageNet - DenseNet-169: 14M params, 3.4 GFLOPs, 75.6% ImageNet - DenseNet-201: 20M params, 4.3 GFLOPs, 76.9% ImageNet **When to Use:** - ✅ **Parameter efficiency**: Good accuracy with few parameters - ✅ **Feature reuse**: Dense connections enable feature reuse - ✅ **Small datasets**: Better gradient flow helps with limited data **When NOT to Use:** - ❌ **Inference speed priority**: Dense connections slow (high memory bandwidth) - ❌ **Training speed**: Slower to train than ResNet - ❌ **Production deployment**: Less mature ecosystem than ResNet **Key Insight:** Dense connections improve gradient flow, enable feature reuse, but slow inference **Status:** Theoretically elegant, but ResNet/EfficientNet more practical ### 6. VGG Family (2014) - Historical Baseline **Architecture:** Very deep (16-19 layers), small 3×3 convolutions, many parameters **Variants:** - VGG-16: 138M params, 15.5 GFLOPs, 71.5% ImageNet - VGG-19: 144M params, 19.6 GFLOPs, 71.1% ImageNet **When to Use:** - ❌ **DON'T use VGG for new projects** - Historical understanding only **Why NOT to Use:** - Massive parameter count (138M vs ResNet-50's 25M) - Poor accuracy for size - Superseded by ResNet (2015) **Key Insight:** Proved that depth matters, but skip connections (ResNet) are better **Status:** **Obsolete** - use ResNet or EfficientNet instead ## Practical Selection Guide ### Scenario 1: Cloud/Server Deployment **Goal:** Best accuracy, no compute constraints **Recommendation:** ``` Small dataset (< 10k images): → EfficientNet-B0 or ResNet-18 (Avoid overfitting with smaller model) Medium dataset (10k-100k images): → EfficientNet-B2 or ResNet-50 (Balanced accuracy and efficiency) Large dataset (> 100k images): → EfficientNet-B4 or ResNet-101 (Can afford larger model) Maximum accuracy (research): → EfficientNet-B7 or Vision Transformer (If dataset > 1M images and compute unlimited) ``` ### Scenario 2: Edge Deployment (Jetson, Coral TPU) **Goal:** Optimize for edge hardware latency **Recommendation:** ``` Real-time requirement (< 10ms): → MobileNetV3-Small or EfficientNet-Lite0 + INT8 quantization Medium latency (10-50ms): → MobileNetV3-Large or EfficientNet-Lite2 Relaxed latency (> 50ms): → EfficientNet-B0 or ResNet-18 ``` **Critical:** Profile on actual edge hardware. Quantization is mandatory (route to ml-production). ### Scenario 3: Mobile Deployment (iOS/Android) **Goal:** On-device inference, minimal battery drain **Recommendation:** ``` All mobile deployments: → MobileNetV3-Large (balanced) → MobileNetV3-Small (fastest, less accurate) Always use: - INT8 quantization (2-4x speedup) - CoreML (iOS) or TFLite (Android) optimization - Benchmark on target device before deploying ``` **Expected latency (iPhone 12, INT8 quantized):** - MobileNetV3-Small: 5-10ms - MobileNetV3-Large: 15-25ms ### Scenario 4: Object Detection **Goal:** Select backbone for detection framework **Recommendation:** ``` Faster R-CNN: → ResNet-50 + FPN (standard) → ResNet-101 + FPN (more accuracy) YOLOv8: → CSPDarknet (built-in, optimized) EfficientDet: → EfficientNet + BiFPN (best efficiency) Custom detection: → ResNet or EfficientNet as backbone → Add Feature Pyramid Network (FPN) for multi-scale ``` **Note:** Detection adds significant compute on top of backbone. Choose efficient backbone. ### Scenario 5: Semantic Segmentation **Goal:** Dense pixel-wise prediction **Recommendation:** ``` U-Net style: → ResNet-18/34 as encoder (fast) → EfficientNet-B0 as encoder (efficient) DeepLabV3: → ResNet-50 (standard) → MobileNetV3 (mobile deployment) Key: Segmentation requires multi-scale features → Ensure backbone has skip connections or FPN ``` ## Trade-Off Analysis ### Accuracy vs Efficiency (Pareto Frontier) **ImageNet Top-1 Accuracy vs FLOPs:** ``` Efficiency Winners (best accuracy per FLOP): 1. EfficientNet-B0: 77.3% @ 0.4 GFLOPs (best efficiency) 2. EfficientNet-B2: 80.3% @ 1.0 GFLOPs 3. EfficientNet-B4: 82.9% @ 4.2 GFLOPs Accuracy Winners (best absolute accuracy): 1. EfficientNet-B7: 84.4% @ 37 GFLOPs 2. ViT-Large: 85.2% @ 190 GFLOPs (requires huge dataset) 3. ResNet-152: 78.3% @ 11.6 GFLOPs (dominated by EfficientNet) Speed Winners (lowest latency): 1. MobileNetV3-Small: 67.4% @ 0.06 GFLOPs (50ms on mobile) 2. MobileNetV3-Large: 75.2% @ 0.2 GFLOPs (100ms on mobile) 3. EfficientNet-Lite0: 75.0% @ 0.4 GFLOPs ``` **Key Takeaway:** EfficientNet dominates ResNet on Pareto frontier (better accuracy at same compute). ### Parameters vs Accuracy **For same ~75% ImageNet accuracy:** ``` VGG-16: 138M params (❌ terrible efficiency) ResNet-50: 25M params EfficientNet-B0: 5M params (✅ 5x fewer parameters!) MobileNetV3-Large: 5M params (fast inference) ``` **Conclusion:** Modern architectures (EfficientNet, MobileNet) achieve same accuracy with far fewer parameters. ## Common Pitfalls ### Pitfall 1: Defaulting to ResNet-50 **Symptom:** Using ResNet-50 without considering alternatives **Why it's wrong:** EfficientNet-B0 matches ResNet-50 accuracy with 10x less compute **Fix:** Consider EfficientNet family first (better efficiency) ### Pitfall 2: Choosing Large Model for Small Dataset **Symptom:** Using ResNet-101 with < 10k images **Why it's wrong:** Model will overfit (too many parameters for data) **Fix:** - < 10k images → ResNet-18 or EfficientNet-B0 - 10k-100k → ResNet-50 or EfficientNet-B2 - > 100k → Can use larger models ### Pitfall 3: Using Desktop Model on Mobile **Symptom:** Trying to run ResNet-50 on mobile device **Why it's wrong:** 2000ms inference time is unusable **Fix:** Use MobileNetV3 + quantization for mobile (15-30ms) ### Pitfall 4: Ignoring Task Type **Symptom:** Using standard CNN for object detection without FPN **Why it's wrong:** Detection needs multi-scale features **Fix:** Use detection-specific frameworks (YOLOv8, Faster R-CNN) with appropriate backbone ### Pitfall 5: Believing "Bigger = Better" **Symptom:** Choosing ResNet-152 over ResNet-50 without justification **Why it's wrong:** Diminishing returns - 3x compute for 1.3% accuracy, will overfit on small data **Fix:** Match model capacity to dataset size, consider efficiency ## Evolution and Historical Context **Why CNNs evolved the way they did:** ``` 2012: AlexNet → Proved deep learning works for vision → 8 layers, 60M params 2014: VGG → Deeper is better (16-19 layers) → But: 138M params (too many) 2014: Inception/GoogLeNet → Multi-scale convolutions → More efficient than VGG 2015: ResNet ★ → Skip connections enable very deep networks (152 layers) → Solved vanishing gradient problem → Became standard baseline 2017: MobileNet → Mobile deployment needs → Depthwise separable convolutions (9x fewer ops) 2017: DenseNet → Dense connections for feature reuse → Parameter efficient but slow inference 2019: EfficientNet ★ → Compound scaling (depth + width + resolution) → Neural architecture search → Dominates Pareto frontier (best accuracy per FLOP) → New standard for efficiency 2020: Vision Transformer → Attention-based (no convolutions) → Requires very large datasets (> 1M images) → For research/large-scale applications ``` **Current Recommendations (2025):** - Cloud: **EfficientNet** (best efficiency) or ResNet (simplicity) - Edge: **EfficientNet-Lite** or MobileNetV3 - Mobile: **MobileNetV3** + quantization - Detection: **EfficientDet** or YOLOv8 - Baseline: **ResNet** (simple, well-tested) ## Decision Checklist Before choosing CNN, answer these: ``` ☐ Deployment target? (cloud/edge/mobile) ☐ Latency requirement? (< 10ms / 10-100ms / > 100ms) ☐ Model size budget? (< 10M / 10-50M / unlimited params) ☐ Dataset size? (< 10k / 10k-100k / > 100k images) ☐ Accuracy priority? (maximum / production / fast iteration) ☐ Task type? (classification / detection / segmentation) ☐ Efficiency matters? (yes → EfficientNet, no → flexibility) Based on answers: → Mobile → MobileNetV3 → Edge → EfficientNet-Lite or MobileNetV3 → Cloud + efficiency → EfficientNet → Cloud + simplicity → ResNet → Maximum accuracy → EfficientNet-B7 or ViT → Small dataset → Small models (ResNet-18, EfficientNet-B0) ``` ## Integration with Other Skills **After selecting CNN architecture:** **Training the model:** → `yzmir/training-optimization/using-training-optimization` - Optimizer selection (Adam, SGD, AdamW) - Learning rate schedules - Data augmentation **Implementing in PyTorch:** → `yzmir/pytorch-engineering/using-pytorch-engineering` - Custom modifications to pre-trained models - Multi-GPU training - Performance optimization **Deploying to production:** → `yzmir/ml-production/using-ml-production` - Quantization (INT8, FP16) - Model serving (TorchServe, ONNX) - Optimization for edge/mobile (TFLite, CoreML) **If architecture is unstable (very deep):** → `yzmir/neural-architectures/normalization-techniques` - Normalization layers (BatchNorm, LayerNorm) - Skip connections - Initialization strategies ## Summary **CNN Selection in One Table:** | Scenario | Recommendation | Why | |----------|----------------|-----| | Cloud, balanced | EfficientNet-B2 | Best efficiency, 80% accuracy | | Cloud, max accuracy | EfficientNet-B4 | 83% accuracy, reasonable compute | | Cloud, simple baseline | ResNet-50 | Well-tested, widely used | | Edge device | MobileNetV3-Large | Optimized for edge, 75% accuracy | | Mobile app | MobileNetV3-Small + quantization | < 20ms inference | | Small dataset (< 10k) | ResNet-18 or EfficientNet-B0 | Avoid overfitting | | Object detection | ResNet-50 + FPN, EfficientDet | Multi-scale features | | Segmentation | ResNet + U-Net, DeepLabV3 | Dense prediction | **Key Principles:** 1. **Match model capacity to dataset size** (small data → small model) 2. **EfficientNet dominates ResNet on efficiency** (same accuracy, less compute) 3. **Mobile needs mobile-specific architectures** (MobileNet, quantization) 4. **Task type matters** (detection/segmentation need multi-scale features) 5. **Bigger ≠ always better** (diminishing returns, overfitting risk) **When in doubt:** Start with **EfficientNet-B2** (cloud) or **MobileNetV3-Large** (edge/mobile). **END OF SKILL**