Initial commit
This commit is contained in:
12
.claude-plugin/plugin.json
Normal file
12
.claude-plugin/plugin.json
Normal file
@@ -0,0 +1,12 @@
|
||||
{
|
||||
"name": "yzmir-ml-production",
|
||||
"description": "Production ML - quantization, serving, MLOps, monitoring, debugging - 11 skills",
|
||||
"version": "1.0.1",
|
||||
"author": {
|
||||
"name": "tachyon-beep",
|
||||
"url": "https://github.com/tachyon-beep"
|
||||
},
|
||||
"skills": [
|
||||
"./skills"
|
||||
]
|
||||
}
|
||||
3
README.md
Normal file
3
README.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# yzmir-ml-production
|
||||
|
||||
Production ML - quantization, serving, MLOps, monitoring, debugging - 11 skills
|
||||
85
plugin.lock.json
Normal file
85
plugin.lock.json
Normal file
@@ -0,0 +1,85 @@
|
||||
{
|
||||
"$schema": "internal://schemas/plugin.lock.v1.json",
|
||||
"pluginId": "gh:tachyon-beep/skillpacks:plugins/yzmir-ml-production",
|
||||
"normalized": {
|
||||
"repo": null,
|
||||
"ref": "refs/tags/v20251128.0",
|
||||
"commit": "2b410bec7eaeb0c8a9bb0944e59216593615f9d4",
|
||||
"treeHash": "eff4907ab2e6683eaed3c80098f70e214f76e7d0b01c6ab1173d990104a2483f",
|
||||
"generatedAt": "2025-11-28T10:28:34.035004Z",
|
||||
"toolVersion": "publish_plugins.py@0.2.0"
|
||||
},
|
||||
"origin": {
|
||||
"remote": "git@github.com:zhongweili/42plugin-data.git",
|
||||
"branch": "master",
|
||||
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
|
||||
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
|
||||
},
|
||||
"manifest": {
|
||||
"name": "yzmir-ml-production",
|
||||
"description": "Production ML - quantization, serving, MLOps, monitoring, debugging - 11 skills",
|
||||
"version": "1.0.1"
|
||||
},
|
||||
"content": {
|
||||
"files": [
|
||||
{
|
||||
"path": "README.md",
|
||||
"sha256": "e662853b40fdc3d0674eaf0d5ab055bbda78021646fa507cacac3a638eaf6726"
|
||||
},
|
||||
{
|
||||
"path": ".claude-plugin/plugin.json",
|
||||
"sha256": "f508684440605ce9616c6203557e05fde2093ff6dcfb786ed6ee5f1886a92470"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-ml-production/hardware-optimization-strategies.md",
|
||||
"sha256": "b7c04b2be4799b096b9f4990ecc0ef9fe0c4fb9d82acb2420a9c66b894336164"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-ml-production/production-debugging-techniques.md",
|
||||
"sha256": "0008cbd074283fb34620636b12ad9f1374eb213b82ceba5ad6f20c3902d2827b"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-ml-production/model-compression-techniques.md",
|
||||
"sha256": "5ee236c2a752571e7f3e700981440059815c89833b621f510d529c6d07952664"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-ml-production/model-serving-patterns.md",
|
||||
"sha256": "b5490e4022c1f818d19217dc35694e49a32d3af235a909e5b17b75fc412ec5e3"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-ml-production/deployment-strategies.md",
|
||||
"sha256": "0c8b29536a152f61998f48e07bd53b628fa0bc5267f65179018645b6e269df4f"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-ml-production/scaling-and-load-balancing.md",
|
||||
"sha256": "dd3a9db87eab63298086799b35e316f6026eb2434873007bd30bd96dc156674b"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-ml-production/production-monitoring-and-alerting.md",
|
||||
"sha256": "1c3771cb0de01570a4cd2fd72259f05018756975e9f6d83740a9604c343d1576"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-ml-production/quantization-for-inference.md",
|
||||
"sha256": "bfa1fadf2a1686a7c6157cc38fe5b71abd5969b7c26fadaef10789feba166ba9"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-ml-production/SKILL.md",
|
||||
"sha256": "8f9352b2c826bcfbf78b21d1e398c519c77b950b3403faeb6e365646f9b5b518"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-ml-production/mlops-pipeline-automation.md",
|
||||
"sha256": "e8474c4ff586cc6710b853c8bec89cbcac44957a320764367fac1227c09efc41"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-ml-production/experiment-tracking-and-versioning.md",
|
||||
"sha256": "410c9c2d9c93783496d8df021d7c6a503661fe03d75d7b76432329589a317c0d"
|
||||
}
|
||||
],
|
||||
"dirSha256": "eff4907ab2e6683eaed3c80098f70e214f76e7d0b01c6ab1173d990104a2483f"
|
||||
},
|
||||
"security": {
|
||||
"scannedAt": null,
|
||||
"scannerVersion": null,
|
||||
"flags": []
|
||||
}
|
||||
}
|
||||
372
skills/using-ml-production/SKILL.md
Normal file
372
skills/using-ml-production/SKILL.md
Normal file
@@ -0,0 +1,372 @@
|
||||
---
|
||||
name: using-ml-production
|
||||
description: Router skill directing to deployment, optimization, MLOps, and monitoring guides.
|
||||
mode: true
|
||||
---
|
||||
|
||||
# Using ML Production
|
||||
|
||||
## Overview
|
||||
|
||||
This meta-skill routes you to the right production deployment skill based on your concern. Load this when you need to move ML models to production but aren't sure which specific aspect to address.
|
||||
|
||||
**Core Principle**: Production concerns fall into four categories. Identify the concern first, then route to the appropriate skill. Tools and infrastructure choices are implementation details, not routing criteria.
|
||||
|
||||
## When to Use
|
||||
|
||||
Load this skill when:
|
||||
- Deploying ML models to production
|
||||
- Optimizing model inference (speed, size, cost)
|
||||
- Setting up MLOps workflows (tracking, automation, CI/CD)
|
||||
- Monitoring or debugging production models
|
||||
- User mentions: "production", "deploy", "serve model", "MLOps", "monitoring", "optimize inference"
|
||||
|
||||
**Don't use for**: Training optimization (use `training-optimization`), model architecture selection (use `neural-architectures`), PyTorch infrastructure (use `pytorch-engineering`)
|
||||
|
||||
## Routing by Concern
|
||||
|
||||
### Category 1: Model Optimization
|
||||
|
||||
**Symptoms**: "Model too slow", "inference latency high", "model too large", "need to optimize for edge", "reduce model size", "speed up inference"
|
||||
|
||||
**When to route here**:
|
||||
- Model itself is the bottleneck (not infrastructure)
|
||||
- Need to reduce model size or increase inference speed
|
||||
- Deploying to resource-constrained hardware (edge, mobile)
|
||||
- Cost optimization through model efficiency
|
||||
|
||||
**Routes to**:
|
||||
- [quantization-for-inference.md](quantization-for-inference.md) - Reduce precision (INT8/INT4), speed up inference
|
||||
- [model-compression-techniques.md](model-compression-techniques.md) - Pruning, distillation, architecture optimization
|
||||
- [hardware-optimization-strategies.md](hardware-optimization-strategies.md) - GPU/CPU/edge tuning, batch sizing
|
||||
|
||||
**Key question to ask**: "Is the MODEL the bottleneck, or is it infrastructure/serving?"
|
||||
|
||||
---
|
||||
|
||||
### Category 2: Serving Infrastructure
|
||||
|
||||
**Symptoms**: "How to serve model", "need API endpoint", "deploy to production", "containerize model", "scale serving", "load balancing", "traffic management"
|
||||
|
||||
**When to route here**:
|
||||
- Need to expose model as API or service
|
||||
- Questions about serving patterns (REST, gRPC, batch)
|
||||
- Deployment strategies (gradual rollout, A/B testing)
|
||||
- Scaling concerns (traffic, replicas, autoscaling)
|
||||
|
||||
**Routes to**:
|
||||
- [model-serving-patterns.md](model-serving-patterns.md) - FastAPI, TorchServe, gRPC, ONNX, batching, containerization
|
||||
- [deployment-strategies.md](deployment-strategies.md) - A/B testing, canary, shadow mode, rollback procedures
|
||||
- [scaling-and-load-balancing.md](scaling-and-load-balancing.md) - Horizontal scaling, autoscaling, load balancing, cost optimization
|
||||
|
||||
**Key distinction**:
|
||||
- Serving patterns = HOW to expose model (API, container, batching)
|
||||
- Deployment strategies = HOW to roll out safely (gradual, testing, rollback)
|
||||
- Scaling = HOW to handle traffic (replicas, autoscaling, balancing)
|
||||
|
||||
---
|
||||
|
||||
### Category 3: MLOps Tooling
|
||||
|
||||
**Symptoms**: "Track experiments", "version models", "automate deployment", "reproducibility", "CI/CD for ML", "feature store", "model registry", "experiment management"
|
||||
|
||||
**When to route here**:
|
||||
- Need workflow/process improvements
|
||||
- Want to track experiments or version models
|
||||
- Need to automate training-to-deployment pipeline
|
||||
- Team collaboration and reproducibility concerns
|
||||
|
||||
**Routes to**:
|
||||
- [experiment-tracking-and-versioning.md](experiment-tracking-and-versioning.md) - MLflow, Weights & Biases, model registries, reproducibility, lineage
|
||||
- [mlops-pipeline-automation.md](mlops-pipeline-automation.md) - CI/CD for ML, feature stores, data validation, automated retraining, orchestration
|
||||
|
||||
**Key distinction**:
|
||||
- Experiment tracking = Research/development phase (track runs, version models)
|
||||
- Pipeline automation = Production phase (automate workflows, CI/CD)
|
||||
|
||||
**Multi-concern**: Queries like "track experiments AND automate deployment" → route to BOTH skills
|
||||
|
||||
---
|
||||
|
||||
### Category 4: Observability
|
||||
|
||||
**Symptoms**: "Monitor production", "model degrading", "detect drift", "production debugging", "alert on failures", "model not working in prod", "performance issues in production"
|
||||
|
||||
**When to route here**:
|
||||
- Model already deployed, need to monitor or debug
|
||||
- Detecting production issues (drift, errors, degradation)
|
||||
- Setting up alerts and dashboards
|
||||
- Root cause analysis for production failures
|
||||
|
||||
**Routes to**:
|
||||
- [production-monitoring-and-alerting.md](production-monitoring-and-alerting.md) - Metrics, drift detection, dashboards, alerts, SLAs
|
||||
- [production-debugging-techniques.md](production-debugging-techniques.md) - Error analysis, profiling, rollback procedures, post-mortems
|
||||
|
||||
**Key distinction**:
|
||||
- Monitoring = Proactive (set up metrics, alerts, detect issues early)
|
||||
- Debugging = Reactive (diagnose and fix existing issues)
|
||||
|
||||
**"Performance" ambiguity**:
|
||||
- If "performance" = speed/latency → might be Category 1 (optimization) or Category 2 (serving/scaling)
|
||||
- If "performance" = accuracy degradation → Category 4 (observability - drift detection)
|
||||
- **Ask clarifying question**: "By performance, do you mean inference speed or model accuracy?"
|
||||
|
||||
---
|
||||
|
||||
## Routing Decision Tree
|
||||
|
||||
```
|
||||
User query → Identify primary concern
|
||||
|
||||
Is model THE problem (size/speed)?
|
||||
YES → Category 1: Model Optimization
|
||||
NO → Continue
|
||||
|
||||
Is it about HOW to expose/deploy model?
|
||||
YES → Category 2: Serving Infrastructure
|
||||
NO → Continue
|
||||
|
||||
Is it about workflow/process/automation?
|
||||
YES → Category 3: MLOps Tooling
|
||||
NO → Continue
|
||||
|
||||
Is it about monitoring/debugging in production?
|
||||
YES → Category 4: Observability
|
||||
NO → Ask clarifying question
|
||||
|
||||
Ambiguous? → Ask ONE question to clarify concern category
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Clarification Questions for Ambiguous Queries
|
||||
|
||||
### Query: "My model is too slow"
|
||||
|
||||
**Ask**: "Is this inference latency (how fast predictions are), or training time?"
|
||||
- Training → Route to `training-optimization` (wrong pack)
|
||||
- Inference → Follow-up: "Have you profiled to find bottlenecks?"
|
||||
- Model is bottleneck → Category 1 (optimization)
|
||||
- Infrastructure/batching issue → Category 2 (serving)
|
||||
|
||||
### Query: "I need to deploy my model"
|
||||
|
||||
**Ask**: "What's your deployment target - cloud server, edge device, or batch processing?"
|
||||
- Cloud/server → Category 2 (serving-patterns, then maybe deployment-strategies if gradual rollout needed)
|
||||
- Edge/mobile → Category 1 (optimization first for size/speed) + Category 2 (serving)
|
||||
- Batch → Category 2 (serving-patterns - batch processing)
|
||||
|
||||
### Query: "My model isn't performing well in production"
|
||||
|
||||
**Ask**: "By performance, do you mean inference speed or prediction accuracy?"
|
||||
- Speed → Category 1 (optimization) or Category 2 (serving/scaling)
|
||||
- Accuracy → Category 4 (observability - drift detection, monitoring)
|
||||
|
||||
### Query: "Set up MLOps for my team"
|
||||
|
||||
**Ask**: "What's the current pain point - experiment tracking, automated deployment, or both?"
|
||||
- Tracking/versioning → Category 3 (experiment-tracking-and-versioning)
|
||||
- Automation/CI/CD → Category 3 (mlops-pipeline-automation)
|
||||
- Both → Route to BOTH skills
|
||||
|
||||
---
|
||||
|
||||
## Multi-Concern Scenarios
|
||||
|
||||
Some queries span multiple categories. Route to ALL relevant skills in logical order:
|
||||
|
||||
| Scenario | Route Order | Why |
|
||||
|----------|-------------|-----|
|
||||
| "Optimize and deploy model" | 1. Optimization → 2. Serving | Optimize BEFORE deploying |
|
||||
| "Deploy and monitor model" | 1. Serving → 2. Observability | Deploy BEFORE monitoring |
|
||||
| "Track experiments and automate deployment" | 1. Experiment tracking → 2. Pipeline automation | Track BEFORE automating |
|
||||
| "Quantize model and serve with TorchServe" | 1. Quantization → 2. Serving patterns | Optimize BEFORE serving |
|
||||
| "Deploy with A/B testing and monitor" | 1. Deployment strategies → 2. Monitoring | Deploy strategy BEFORE monitoring |
|
||||
|
||||
**Principle**: Route in execution order (what needs to happen first).
|
||||
|
||||
---
|
||||
|
||||
## Relationship with Other Packs
|
||||
|
||||
### With llm-specialist
|
||||
|
||||
**ml-production covers**: General serving, quantization, deployment, monitoring (universal patterns)
|
||||
|
||||
**llm-specialist covers**: LLM-specific optimization (KV cache, prompt caching, speculative decoding, token streaming)
|
||||
|
||||
**When to use both**:
|
||||
- "Deploy LLM to production" → llm-specialist (for inference-optimization) + ml-production (for serving, monitoring)
|
||||
- "Quantize LLM" → llm-specialist (LLM-specific quantization patterns) OR ml-production (general quantization)
|
||||
|
||||
**Rule of thumb**: LLM-specific optimization stays in llm-specialist. General production patterns use ml-production.
|
||||
|
||||
### With training-optimization
|
||||
|
||||
**Clear boundary**:
|
||||
- training-optimization = Training phase (convergence, hyperparameters, training speed)
|
||||
- ml-production = Inference phase (deployment, serving, monitoring)
|
||||
|
||||
**"Too slow" disambiguation**:
|
||||
- Training slow → training-optimization
|
||||
- Inference slow → ml-production
|
||||
|
||||
### With pytorch-engineering
|
||||
|
||||
**pytorch-engineering covers**: Foundation (distributed training, profiling, memory management)
|
||||
|
||||
**ml-production covers**: Production-specific (serving APIs, deployment patterns, MLOps)
|
||||
|
||||
**When to use both**:
|
||||
- "Profile production inference" → pytorch-engineering (profiling techniques) + ml-production (production context)
|
||||
- "Optimize serving performance" → ml-production (serving patterns) + pytorch-engineering (if need low-level profiling)
|
||||
|
||||
---
|
||||
|
||||
## Common Routing Mistakes
|
||||
|
||||
| Query | Wrong Route | Correct Route | Why |
|
||||
|-------|-------------|---------------|-----|
|
||||
| "Model too slow in production" | Immediately to quantization | Ask: inference or training? Then model vs infrastructure? | Could be serving/batching issue, not model |
|
||||
| "Deploy with Kubernetes" | Defer to Kubernetes docs | Category 2: serving-patterns or deployment-strategies | Kubernetes is tool choice, not routing concern |
|
||||
| "Set up MLOps" | Route to one skill | Ask about specific pain point, might be both tracking AND automation | MLOps spans multiple skills |
|
||||
| "Performance issues" | Assume accuracy | Ask: speed or accuracy? | Performance is ambiguous |
|
||||
| "We use TorchServe" | Skip routing | Still route to serving-patterns | Tool choice doesn't change routing |
|
||||
|
||||
---
|
||||
|
||||
## Common Rationalizations (Don't Do These)
|
||||
|
||||
| Excuse | Reality |
|
||||
|--------|---------|
|
||||
| "User mentioned Kubernetes, route to deployment" | Tools are implementation details. Route by concern first. |
|
||||
| "Slow = optimization, route to quantization" | Slow could be infrastructure. Clarify model vs serving bottleneck. |
|
||||
| "They said deploy, must be serving-patterns" | Could need serving + deployment-strategies + monitoring. Don't assume single concern. |
|
||||
| "MLOps = experiment tracking" | MLOps spans tracking AND automation. Ask which pain point. |
|
||||
| "Performance obviously means speed" | Could mean accuracy. Clarify inference speed vs prediction quality. |
|
||||
| "They're technical, skip clarification" | Technical users still benefit from clarifying questions. |
|
||||
|
||||
---
|
||||
|
||||
## Red Flags Checklist
|
||||
|
||||
If you catch yourself thinking ANY of these, STOP and clarify:
|
||||
|
||||
- "I'll guess optimization vs serving" → ASK which is the bottleneck
|
||||
- "Performance probably means speed" → ASK speed or accuracy
|
||||
- "Deploy = serving-patterns only" → Consider deployment-strategies and monitoring too
|
||||
- "They mentioned [tool], route based on tool" → Route by CONCERN, not tool
|
||||
- "MLOps = one skill" → Could span experiment tracking AND automation
|
||||
- "Skip question to save time" → Clarifying prevents wrong routing
|
||||
|
||||
**When in doubt**: Ask ONE clarifying question. 10 seconds of clarification prevents minutes of wrong-skill loading.
|
||||
|
||||
---
|
||||
|
||||
## Routing Summary Table
|
||||
|
||||
| User Concern | Ask Clarifying | Route To | Also Consider |
|
||||
|--------------|----------------|----------|---------------|
|
||||
| Model slow/large | Inference or training? | Optimization skills | If inference, check serving too |
|
||||
| Deploy model | Target (cloud/edge/batch)? | Serving patterns | Deployment strategies for gradual rollout |
|
||||
| Production monitoring | Proactive or reactive? | Monitoring OR debugging | Both if setting up + fixing issues |
|
||||
| MLOps setup | Tracking or automation? | Experiment tracking AND/OR automation | Often both needed |
|
||||
| Performance issues | Speed or accuracy? | Optimization OR observability | Depends on clarification |
|
||||
| Scale serving | Traffic pattern? | Scaling-and-load-balancing | Serving patterns if not set up yet |
|
||||
|
||||
---
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### Example 1: Full Production Pipeline
|
||||
|
||||
**Query**: "I trained a model, now I need to put it in production"
|
||||
|
||||
**Routing**:
|
||||
1. Ask: "What's your deployment target and are there performance concerns?"
|
||||
2. If "cloud deployment, model is fast enough":
|
||||
- [model-serving-patterns.md](model-serving-patterns.md) (expose as API)
|
||||
- [deployment-strategies.md](deployment-strategies.md) (if gradual rollout needed)
|
||||
- [production-monitoring-and-alerting.md](production-monitoring-and-alerting.md) (set up observability)
|
||||
3. If "edge device, model too large":
|
||||
- [quantization-for-inference.md](quantization-for-inference.md) (reduce size first)
|
||||
- [model-serving-patterns.md](model-serving-patterns.md) (edge deployment pattern)
|
||||
- [production-monitoring-and-alerting.md](production-monitoring-and-alerting.md) (if possible on edge)
|
||||
|
||||
### Example 2: Optimization Decision
|
||||
|
||||
**Query**: "My inference is slow"
|
||||
|
||||
**Routing**:
|
||||
1. Ask: "Have you profiled to find the bottleneck - is it the model or serving infrastructure?"
|
||||
2. If "not profiled yet":
|
||||
- [production-debugging-techniques.md](production-debugging-techniques.md) (profile first to diagnose)
|
||||
- Then route based on findings
|
||||
3. If "model is bottleneck":
|
||||
- [hardware-optimization-strategies.md](hardware-optimization-strategies.md) (check if hardware tuning helps)
|
||||
- If not enough → [quantization-for-inference.md](quantization-for-inference.md) or [model-compression-techniques.md](model-compression-techniques.md)
|
||||
4. If "infrastructure/batching is bottleneck":
|
||||
- [model-serving-patterns.md](model-serving-patterns.md) (batching strategies)
|
||||
- [scaling-and-load-balancing.md](scaling-and-load-balancing.md) (if traffic-related)
|
||||
|
||||
### Example 3: MLOps Maturity
|
||||
|
||||
**Query**: "We need better ML workflows"
|
||||
|
||||
**Routing**:
|
||||
1. Ask: "What's the current pain point - can't reproduce experiments, manual deployment, or both?"
|
||||
2. If "can't reproduce, need to track experiments":
|
||||
- [experiment-tracking-and-versioning.md](experiment-tracking-and-versioning.md)
|
||||
3. If "manual deployment is slow":
|
||||
- [mlops-pipeline-automation.md](mlops-pipeline-automation.md)
|
||||
4. If "both reproducibility and automation":
|
||||
- [experiment-tracking-and-versioning.md](experiment-tracking-and-versioning.md) (establish tracking first)
|
||||
- [mlops-pipeline-automation.md](mlops-pipeline-automation.md) (then automate workflow)
|
||||
|
||||
---
|
||||
|
||||
## When NOT to Use ml-production Skills
|
||||
|
||||
**Skip ml-production when:**
|
||||
- Still designing/training model → Use neural-architectures, training-optimization
|
||||
- PyTorch infrastructure issues → Use pytorch-engineering
|
||||
- LLM-specific optimization only → Use llm-specialist (unless also need serving)
|
||||
- Classical ML deployment → ml-production still applies but consider if gradient boosting/sklearn instead
|
||||
|
||||
**Red flag**: If model isn't trained yet, probably don't need ml-production. Finish training first.
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
You've routed correctly when:
|
||||
- ✅ Identified concern category (optimization, serving, MLOps, observability)
|
||||
- ✅ Asked clarifying question for ambiguous queries
|
||||
- ✅ Routed to appropriate skill(s) in logical order
|
||||
- ✅ Didn't let tool choices (Kubernetes, TorchServe) dictate routing
|
||||
- ✅ Recognized multi-concern scenarios and routed to multiple skills
|
||||
|
||||
---
|
||||
|
||||
## ML Production Specialist Skills Catalog
|
||||
|
||||
After routing, load the appropriate specialist skill for detailed guidance:
|
||||
|
||||
1. [quantization-for-inference.md](quantization-for-inference.md) - INT8/INT4 quantization, post-training quantization, quantization-aware training, precision reduction for inference speed
|
||||
2. [model-compression-techniques.md](model-compression-techniques.md) - Pruning (structured/unstructured), knowledge distillation, architecture optimization, model size reduction
|
||||
3. [hardware-optimization-strategies.md](hardware-optimization-strategies.md) - GPU/CPU/edge tuning, batch sizing, memory optimization, hardware-specific acceleration (TensorRT, ONNX Runtime)
|
||||
4. [model-serving-patterns.md](model-serving-patterns.md) - FastAPI, TorchServe, gRPC, ONNX, batching strategies, containerization (Docker), REST/gRPC APIs
|
||||
5. [deployment-strategies.md](deployment-strategies.md) - A/B testing, canary deployment, shadow mode, gradual rollout, rollback procedures, blue-green deployment
|
||||
6. [scaling-and-load-balancing.md](scaling-and-load-balancing.md) - Horizontal scaling, autoscaling, load balancing, traffic management, cost optimization, replica management
|
||||
7. [experiment-tracking-and-versioning.md](experiment-tracking-and-versioning.md) - MLflow, Weights & Biases, model registries, experiment reproducibility, model lineage, versioning
|
||||
8. [mlops-pipeline-automation.md](mlops-pipeline-automation.md) - CI/CD for ML, feature stores, data validation, automated retraining, orchestration (Airflow, Kubeflow)
|
||||
9. [production-monitoring-and-alerting.md](production-monitoring-and-alerting.md) - Metrics tracking, drift detection, dashboards, alerting, SLAs, proactive monitoring
|
||||
10. [production-debugging-techniques.md](production-debugging-techniques.md) - Error analysis, production profiling, rollback procedures, post-mortems, root cause analysis
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- See design doc: `docs/plans/2025-10-30-ml-production-pack-design.md`
|
||||
- Primary router: `yzmir/ai-engineering-expert/using-ai-engineering`
|
||||
- Related packs: `llm-specialist/using-llm-specialist`, `training-optimization/using-training-optimization`
|
||||
3482
skills/using-ml-production/deployment-strategies.md
Normal file
3482
skills/using-ml-production/deployment-strategies.md
Normal file
File diff suppressed because it is too large
Load Diff
2565
skills/using-ml-production/experiment-tracking-and-versioning.md
Normal file
2565
skills/using-ml-production/experiment-tracking-and-versioning.md
Normal file
File diff suppressed because it is too large
Load Diff
1323
skills/using-ml-production/hardware-optimization-strategies.md
Normal file
1323
skills/using-ml-production/hardware-optimization-strategies.md
Normal file
File diff suppressed because it is too large
Load Diff
2615
skills/using-ml-production/mlops-pipeline-automation.md
Normal file
2615
skills/using-ml-production/mlops-pipeline-automation.md
Normal file
File diff suppressed because it is too large
Load Diff
1194
skills/using-ml-production/model-compression-techniques.md
Normal file
1194
skills/using-ml-production/model-compression-techniques.md
Normal file
File diff suppressed because it is too large
Load Diff
1667
skills/using-ml-production/model-serving-patterns.md
Normal file
1667
skills/using-ml-production/model-serving-patterns.md
Normal file
File diff suppressed because it is too large
Load Diff
3466
skills/using-ml-production/production-debugging-techniques.md
Normal file
3466
skills/using-ml-production/production-debugging-techniques.md
Normal file
File diff suppressed because it is too large
Load Diff
1412
skills/using-ml-production/production-monitoring-and-alerting.md
Normal file
1412
skills/using-ml-production/production-monitoring-and-alerting.md
Normal file
File diff suppressed because it is too large
Load Diff
991
skills/using-ml-production/quantization-for-inference.md
Normal file
991
skills/using-ml-production/quantization-for-inference.md
Normal file
@@ -0,0 +1,991 @@
|
||||
|
||||
# Quantization for Inference Skill
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when you observe these symptoms:
|
||||
|
||||
**Performance Symptoms:**
|
||||
- Model inference too slow on CPU (e.g., >10ms when need <5ms)
|
||||
- Batch processing taking too long (low throughput)
|
||||
- Need to serve more requests per second with same hardware
|
||||
|
||||
**Size Symptoms:**
|
||||
- Model too large for edge devices (e.g., 100MB+ for mobile)
|
||||
- Want to fit more models in GPU memory
|
||||
- Memory-constrained deployment environment
|
||||
|
||||
**Deployment Symptoms:**
|
||||
- Deploying to CPU servers (quantization gives 2-4× CPU speedup)
|
||||
- Deploying to edge devices (mobile, IoT, embedded systems)
|
||||
- Cost-sensitive deployment (smaller models = lower hosting costs)
|
||||
|
||||
**When NOT to use this skill:**
|
||||
- Model already fast enough and small enough (no problem to solve)
|
||||
- Deploying exclusively on GPU with no memory constraints (modest benefit)
|
||||
- Prototyping phase where optimization is premature
|
||||
- Model so small that quantization overhead not worth it (e.g., <5MB)
|
||||
|
||||
## Core Principle
|
||||
|
||||
**Quantization trades precision for performance.**
|
||||
|
||||
Quantization converts high-precision numbers (FP32: 32 bits) to low-precision integers (INT8: 8 bits or INT4: 4 bits). This provides:
|
||||
- **4-8× smaller model size** (fewer bits per parameter)
|
||||
- **2-4× faster inference on CPU** (INT8 operations faster than FP32)
|
||||
- **Small accuracy loss** (typically 0.5-1% for INT8)
|
||||
|
||||
**Formula:** Lower precision (FP32 → INT8 → INT4) = Smaller size + Faster inference + More accuracy loss
|
||||
|
||||
The skill is choosing the **right precision for your accuracy tolerance**.
|
||||
|
||||
## Quantization Framework
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────┐
|
||||
│ 1. Recognize Quantization Need │
|
||||
│ CPU/Edge + (Slow OR Large) │
|
||||
└──────────────┬─────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌────────────────────────────────────────────┐
|
||||
│ 2. Choose Quantization Type │
|
||||
│ Dynamic → Static → QAT (increasing cost) │
|
||||
└──────────────┬─────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌────────────────────────────────────────────┐
|
||||
│ 3. Calibrate (if Static/QAT) │
|
||||
│ 100-1000 representative samples │
|
||||
└──────────────┬─────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌────────────────────────────────────────────┐
|
||||
│ 4. Validate Accuracy Trade-offs │
|
||||
│ Baseline vs Quantized accuracy │
|
||||
└──────────────┬─────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌────────────────────────────────────────────┐
|
||||
│ 5. Decide: Accept or Iterate │
|
||||
│ <2% loss → Deploy │
|
||||
│ >2% loss → Try QAT or different precision│
|
||||
└────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Part 1: Quantization Types
|
||||
|
||||
### Type 1: Dynamic Quantization
|
||||
|
||||
**What it does:** Quantizes weights to INT8, keeps activations in FP32.
|
||||
|
||||
**When to use:**
|
||||
- Simplest quantization (no calibration needed)
|
||||
- Primary goal is size reduction
|
||||
- Batch processing where latency less critical
|
||||
- Quick experiment to see if quantization helps
|
||||
|
||||
**Benefits:**
|
||||
- ✅ 4× size reduction (weights are 75% of model size)
|
||||
- ✅ 1.2-1.5× CPU speedup (modest, because activations still FP32)
|
||||
- ✅ Minimal accuracy loss (~0.2-0.5%)
|
||||
- ✅ No calibration data needed
|
||||
|
||||
**Limitations:**
|
||||
- ⚠️ Limited CPU speedup (activations still FP32)
|
||||
- ⚠️ Not optimal for edge devices needing maximum performance
|
||||
|
||||
**PyTorch implementation:**
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.quantization
|
||||
|
||||
# WHY: Dynamic quantization is simplest - just one function call
|
||||
# No calibration data needed because activations stay FP32
|
||||
model = torch.load('model.pth')
|
||||
model.eval() # WHY: Must be in eval mode (no batchnorm updates)
|
||||
|
||||
# WHY: Specify which layers to quantize (Linear, LSTM, etc.)
|
||||
# These layers benefit most from quantization
|
||||
quantized_model = torch.quantization.quantize_dynamic(
|
||||
model,
|
||||
qconfig_spec={torch.nn.Linear}, # WHY: Quantize Linear layers only
|
||||
dtype=torch.qint8 # WHY: INT8 is standard precision
|
||||
)
|
||||
|
||||
# Save quantized model
|
||||
torch.save(quantized_model.state_dict(), 'model_quantized_dynamic.pth')
|
||||
|
||||
# Verify size reduction
|
||||
original_size = os.path.getsize('model.pth') / (1024 ** 2) # MB
|
||||
quantized_size = os.path.getsize('model_quantized_dynamic.pth') / (1024 ** 2)
|
||||
print(f"Original: {original_size:.1f}MB → Quantized: {quantized_size:.1f}MB")
|
||||
print(f"Size reduction: {original_size / quantized_size:.1f}×")
|
||||
```
|
||||
|
||||
**Example use case:** BERT classification model where primary goal is reducing size from 440MB to 110MB for easier deployment.
|
||||
|
||||
|
||||
### Type 2: Static Quantization (Post-Training Quantization)
|
||||
|
||||
**What it does:** Quantizes both weights and activations to INT8.
|
||||
|
||||
**When to use:**
|
||||
- Need maximum CPU speedup (2-4×)
|
||||
- Deploying to CPU servers or edge devices
|
||||
- Can afford calibration step (5-10 minutes)
|
||||
- Primary goal is inference speed
|
||||
|
||||
**Benefits:**
|
||||
- ✅ 4× size reduction (same as dynamic)
|
||||
- ✅ 2-4× CPU speedup (both weights and activations INT8)
|
||||
- ✅ No retraining required (post-training)
|
||||
- ✅ Acceptable accuracy loss (~0.5-1%)
|
||||
|
||||
**Requirements:**
|
||||
- ⚠️ Needs calibration data (100-1000 samples from validation set)
|
||||
- ⚠️ Slightly more complex setup than dynamic
|
||||
|
||||
**PyTorch implementation:**
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.quantization
|
||||
|
||||
def calibrate_model(model, calibration_loader):
|
||||
"""
|
||||
Calibrate model by running representative data through it.
|
||||
|
||||
WHY: Static quantization needs to know activation ranges.
|
||||
Calibration finds min/max values for each activation layer.
|
||||
|
||||
Args:
|
||||
model: Model in eval mode with quantization stubs
|
||||
calibration_loader: DataLoader with 100-1000 samples
|
||||
"""
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
for batch_idx, (data, _) in enumerate(calibration_loader):
|
||||
model(data)
|
||||
if batch_idx >= 100: # WHY: 100 batches usually sufficient
|
||||
break
|
||||
return model
|
||||
|
||||
# Step 1: Prepare model for quantization
|
||||
model = torch.load('model.pth')
|
||||
model.eval()
|
||||
|
||||
# WHY: Insert quantization/dequantization stubs at boundaries
|
||||
# This tells PyTorch where to convert between FP32 and INT8
|
||||
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
|
||||
torch.quantization.prepare(model, inplace=True)
|
||||
|
||||
# Step 2: Calibrate with representative data
|
||||
# WHY: Must use data from training/validation set, not random data
|
||||
# Calibration finds activation ranges - needs real distribution
|
||||
calibration_dataset = torch.utils.data.Subset(
|
||||
val_dataset,
|
||||
indices=range(1000) # WHY: 1000 samples sufficient for most models
|
||||
)
|
||||
calibration_loader = torch.utils.data.DataLoader(
|
||||
calibration_dataset,
|
||||
batch_size=32,
|
||||
shuffle=False # WHY: Order doesn't matter for calibration
|
||||
)
|
||||
|
||||
model = calibrate_model(model, calibration_loader)
|
||||
|
||||
# Step 3: Convert to quantized model
|
||||
torch.quantization.convert(model, inplace=True)
|
||||
|
||||
# Save quantized model
|
||||
torch.save(model.state_dict(), 'model_quantized_static.pth')
|
||||
|
||||
# Benchmark speed improvement
|
||||
import time
|
||||
|
||||
def benchmark(model, data, num_iterations=100):
|
||||
"""WHY: Warm up model first, then measure average latency."""
|
||||
model.eval()
|
||||
# Warm up (first few iterations slower)
|
||||
for _ in range(10):
|
||||
model(data)
|
||||
|
||||
start = time.time()
|
||||
with torch.no_grad():
|
||||
for _ in range(num_iterations):
|
||||
model(data)
|
||||
end = time.time()
|
||||
return (end - start) / num_iterations * 1000 # ms per inference
|
||||
|
||||
test_data = torch.randn(1, 3, 224, 224) # Example input
|
||||
|
||||
baseline_latency = benchmark(original_model, test_data)
|
||||
quantized_latency = benchmark(model, test_data)
|
||||
|
||||
print(f"Baseline: {baseline_latency:.2f}ms")
|
||||
print(f"Quantized: {quantized_latency:.2f}ms")
|
||||
print(f"Speedup: {baseline_latency / quantized_latency:.2f}×")
|
||||
```
|
||||
|
||||
**Example use case:** ResNet50 image classifier for CPU inference - need <5ms latency, achieve 4ms with static quantization (vs 15ms baseline).
|
||||
|
||||
|
||||
### Type 3: Quantization-Aware Training (QAT)
|
||||
|
||||
**What it does:** Simulates quantization during training to minimize accuracy loss.
|
||||
|
||||
**When to use:**
|
||||
- Static quantization accuracy loss too large (>2%)
|
||||
- Need best possible accuracy with INT8
|
||||
- Can afford retraining (hours to days)
|
||||
- Critical production system with strict accuracy requirements
|
||||
|
||||
**Benefits:**
|
||||
- ✅ Best accuracy (~0.1-0.3% loss vs 0.5-1% for static)
|
||||
- ✅ 4× size reduction (same as dynamic/static)
|
||||
- ✅ 2-4× CPU speedup (same as static)
|
||||
|
||||
**Limitations:**
|
||||
- ⚠️ Requires retraining (most expensive option)
|
||||
- ⚠️ Takes hours to days depending on model size
|
||||
- ⚠️ More complex implementation
|
||||
|
||||
**PyTorch implementation:**
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.quantization
|
||||
|
||||
def train_one_epoch_qat(model, train_loader, optimizer, criterion):
|
||||
"""
|
||||
Train one epoch with quantization-aware training.
|
||||
|
||||
WHY: QAT inserts fake quantization ops during training.
|
||||
Model learns to be robust to quantization errors.
|
||||
"""
|
||||
model.train()
|
||||
for data, target in train_loader:
|
||||
optimizer.zero_grad()
|
||||
output = model(data)
|
||||
loss = criterion(output, target)
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
return model
|
||||
|
||||
# Step 1: Prepare model for QAT
|
||||
model = torch.load('model.pth')
|
||||
model.train()
|
||||
|
||||
# WHY: QAT config includes fake quantization ops
|
||||
# These simulate quantization during forward pass
|
||||
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
|
||||
torch.quantization.prepare_qat(model, inplace=True)
|
||||
|
||||
# Step 2: Train with quantization-aware training
|
||||
# WHY: Model learns to compensate for quantization errors
|
||||
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001) # WHY: Low LR for fine-tuning
|
||||
criterion = torch.nn.CrossEntropyLoss()
|
||||
|
||||
num_epochs = 5 # WHY: Usually 5-10 epochs sufficient for QAT fine-tuning
|
||||
for epoch in range(num_epochs):
|
||||
model = train_one_epoch_qat(model, train_loader, optimizer, criterion)
|
||||
print(f"Epoch {epoch+1}/{num_epochs} complete")
|
||||
|
||||
# Step 3: Convert to quantized model
|
||||
model.eval()
|
||||
torch.quantization.convert(model, inplace=True)
|
||||
|
||||
# Save QAT quantized model
|
||||
torch.save(model.state_dict(), 'model_quantized_qat.pth')
|
||||
```
|
||||
|
||||
**Example use case:** Medical imaging model where accuracy is critical - static quantization gives 2% accuracy loss, QAT reduces to 0.3%.
|
||||
|
||||
|
||||
## Part 2: Quantization Type Decision Matrix
|
||||
|
||||
| Type | Complexity | Calibration | Retraining | Size Reduction | CPU Speedup | Accuracy Loss |
|
||||
|------|-----------|-------------|------------|----------------|-------------|---------------|
|
||||
| **Dynamic** | Low | No | No | 4× | 1.2-1.5× | ~0.2-0.5% |
|
||||
| **Static** | Medium | Yes | No | 4× | 2-4× | ~0.5-1% |
|
||||
| **QAT** | High | Yes | Yes | 4× | 2-4× | ~0.1-0.3% |
|
||||
|
||||
**Decision flow:**
|
||||
1. Start with **dynamic quantization**: Simplest, verify quantization helps
|
||||
2. Upgrade to **static quantization**: If need more speedup, can afford calibration
|
||||
3. Use **QAT**: Only if accuracy loss from static too large (rare)
|
||||
|
||||
**Why this order?** Incremental cost. Dynamic is free (5 minutes), static is cheap (15 minutes), QAT is expensive (hours/days). Don't pay for QAT unless you need it.
|
||||
|
||||
|
||||
## Part 3: Calibration Best Practices
|
||||
|
||||
### What is Calibration?
|
||||
|
||||
**Purpose:** Find min/max ranges for each activation layer.
|
||||
|
||||
**Why needed:** Static quantization needs to know activation ranges to map FP32 → INT8. Without calibration, ranges are wrong → accuracy collapses.
|
||||
|
||||
**How it works:**
|
||||
1. Run representative data through model
|
||||
2. Record min/max activation values per layer
|
||||
3. Use these ranges to quantize activations at inference time
|
||||
|
||||
### Calibration Data Requirements
|
||||
|
||||
**Data source:**
|
||||
- ✅ **Use validation set samples** (matches training distribution)
|
||||
- ❌ Don't use random images from internet (different distribution)
|
||||
- ❌ Don't use single image repeated (insufficient coverage)
|
||||
- ❌ Don't use training set that doesn't match deployment (distribution shift)
|
||||
|
||||
**Data size:**
|
||||
- **Minimum:** 100 samples (sufficient for simple models)
|
||||
- **Recommended:** 500-1000 samples (better coverage)
|
||||
- **Maximum:** Full validation set is overkill (slow, no benefit)
|
||||
|
||||
**Data characteristics:**
|
||||
- Must cover range of inputs model sees in production
|
||||
- Include edge cases (bright/dark images, long/short text)
|
||||
- Distribution should match deployment, not just training
|
||||
- Class balance less important than input diversity
|
||||
|
||||
**Example calibration data selection:**
|
||||
|
||||
```python
|
||||
import torch
|
||||
import numpy as np
|
||||
|
||||
def select_calibration_data(val_dataset, num_samples=1000):
|
||||
"""
|
||||
Select diverse calibration samples from validation set.
|
||||
|
||||
WHY: Want samples that cover range of activation values.
|
||||
Random selection from validation set usually sufficient.
|
||||
|
||||
Args:
|
||||
val_dataset: Full validation dataset
|
||||
num_samples: Number of calibration samples (default 1000)
|
||||
|
||||
Returns:
|
||||
Calibration dataset subset
|
||||
"""
|
||||
# WHY: Random selection ensures diversity
|
||||
# Stratified sampling can help ensure class coverage
|
||||
indices = np.random.choice(len(val_dataset), num_samples, replace=False)
|
||||
calibration_dataset = torch.utils.data.Subset(val_dataset, indices)
|
||||
|
||||
return calibration_dataset
|
||||
|
||||
# Example: Select 1000 random samples from validation set
|
||||
calibration_dataset = select_calibration_data(val_dataset, num_samples=1000)
|
||||
calibration_loader = torch.utils.data.DataLoader(
|
||||
calibration_dataset,
|
||||
batch_size=32,
|
||||
shuffle=False # WHY: Order doesn't matter for calibration
|
||||
)
|
||||
```
|
||||
|
||||
### Common Calibration Pitfalls
|
||||
|
||||
**Pitfall 1: Using wrong data distribution**
|
||||
- ❌ "Random images from internet" for ImageNet-trained model
|
||||
- ✅ Use ImageNet validation set samples
|
||||
|
||||
**Pitfall 2: Too few samples**
|
||||
- ❌ 10 samples (insufficient coverage of activation ranges)
|
||||
- ✅ 100-1000 samples (good coverage)
|
||||
|
||||
**Pitfall 3: Using training data that doesn't match deployment**
|
||||
- ❌ Calibrate on sunny outdoor images, deploy on indoor images
|
||||
- ✅ Calibrate on data matching deployment distribution
|
||||
|
||||
**Pitfall 4: Skipping calibration validation**
|
||||
- ❌ Calibrate once, assume it works
|
||||
- ✅ Validate accuracy after calibration to verify ranges are good
|
||||
|
||||
|
||||
## Part 4: Precision Selection (INT8 vs INT4 vs FP16)
|
||||
|
||||
### Precision Spectrum
|
||||
|
||||
| Precision | Bits | Size vs FP32 | Speedup (CPU) | Typical Accuracy Loss |
|
||||
|-----------|------|--------------|---------------|----------------------|
|
||||
| **FP32** | 32 | 1× | 1× | 0% (baseline) |
|
||||
| **FP16** | 16 | 2× | 1.5× | <0.1% |
|
||||
| **INT8** | 8 | 4× | 2-4× | 0.5-1% |
|
||||
| **INT4** | 4 | 8× | 4-8× | 1-3% |
|
||||
|
||||
**Trade-off:** Lower precision = Smaller size + Faster inference + More accuracy loss
|
||||
|
||||
### When to Use Each Precision
|
||||
|
||||
**FP16 (Half Precision):**
|
||||
- GPU inference (Tensor Cores optimized for FP16)
|
||||
- Need minimal accuracy loss (<0.1%)
|
||||
- Size reduction secondary concern
|
||||
- **Example:** Large language models on GPU
|
||||
|
||||
**INT8 (Standard Quantization):**
|
||||
- CPU inference (INT8 operations fast on CPU)
|
||||
- Edge device deployment
|
||||
- Good balance of size/speed/accuracy
|
||||
- **Most common choice** for production deployment
|
||||
- **Example:** Image classification on mobile devices
|
||||
|
||||
**INT4 (Aggressive Quantization):**
|
||||
- Extremely memory-constrained (e.g., 1GB mobile devices)
|
||||
- Can tolerate larger accuracy loss (1-3%)
|
||||
- Need maximum size reduction (8×)
|
||||
- **Use sparingly** - accuracy risk high
|
||||
- **Example:** Large language models (LLaMA-7B: 13GB → 3.5GB)
|
||||
|
||||
### Decision Flow
|
||||
|
||||
```python
|
||||
def choose_precision(accuracy_tolerance, deployment_target):
|
||||
"""
|
||||
Choose quantization precision based on requirements.
|
||||
|
||||
WHY: Different precisions for different constraints.
|
||||
INT8 is default, FP16 for GPU, INT4 for extreme memory constraints.
|
||||
"""
|
||||
if accuracy_tolerance < 0.1:
|
||||
return "FP16" # Minimal accuracy loss required
|
||||
elif deployment_target == "GPU":
|
||||
return "FP16" # GPU optimized for FP16
|
||||
elif deployment_target in ["CPU", "edge"]:
|
||||
return "INT8" # CPU optimized for INT8
|
||||
elif deployment_target == "extreme_edge" and accuracy_tolerance > 1:
|
||||
return "INT4" # Only if can tolerate 1-3% loss
|
||||
else:
|
||||
return "INT8" # Default safe choice
|
||||
```
|
||||
|
||||
|
||||
## Part 5: ONNX Quantization (Cross-Framework)
|
||||
|
||||
**When to use:** Deploying to ONNX Runtime (CPU/edge devices) or need cross-framework compatibility.
|
||||
|
||||
### ONNX Static Quantization
|
||||
|
||||
```python
|
||||
import onnxruntime
|
||||
from onnxruntime.quantization import quantize_static, CalibrationDataReader
|
||||
import numpy as np
|
||||
|
||||
class CalibrationDataReaderWrapper(CalibrationDataReader):
|
||||
"""
|
||||
WHY: ONNX requires custom calibration data reader.
|
||||
This class feeds calibration data to ONNX quantization engine.
|
||||
"""
|
||||
def __init__(self, calibration_data):
|
||||
self.calibration_data = calibration_data
|
||||
self.iterator = iter(calibration_data)
|
||||
|
||||
def get_next(self):
|
||||
"""WHY: Called by ONNX to get next calibration batch."""
|
||||
try:
|
||||
data, _ = next(self.iterator)
|
||||
return {"input": data.numpy()} # WHY: Return dict of input name → data
|
||||
except StopIteration:
|
||||
return None
|
||||
|
||||
# Step 1: Export PyTorch model to ONNX
|
||||
model = torch.load('model.pth')
|
||||
model.eval()
|
||||
dummy_input = torch.randn(1, 3, 224, 224)
|
||||
|
||||
torch.onnx.export(
|
||||
model,
|
||||
dummy_input,
|
||||
'model.onnx',
|
||||
input_names=['input'],
|
||||
output_names=['output'],
|
||||
opset_version=13 # WHY: ONNX opset 13+ supports quantization ops
|
||||
)
|
||||
|
||||
# Step 2: Prepare calibration data
|
||||
calibration_loader = torch.utils.data.DataLoader(
|
||||
calibration_dataset,
|
||||
batch_size=1, # WHY: ONNX calibration uses batch size 1
|
||||
shuffle=False
|
||||
)
|
||||
calibration_reader = CalibrationDataReaderWrapper(calibration_loader)
|
||||
|
||||
# Step 3: Quantize ONNX model
|
||||
quantize_static(
|
||||
'model.onnx',
|
||||
'model_quantized.onnx',
|
||||
calibration_data_reader=calibration_reader,
|
||||
quant_format='QDQ' # WHY: QDQ format compatible with most backends
|
||||
)
|
||||
|
||||
# Step 4: Benchmark ONNX quantized model
|
||||
import time
|
||||
|
||||
session = onnxruntime.InferenceSession('model_quantized.onnx')
|
||||
input_name = session.get_inputs()[0].name
|
||||
|
||||
test_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
|
||||
|
||||
# Warm up
|
||||
for _ in range(10):
|
||||
session.run(None, {input_name: test_data})
|
||||
|
||||
# Benchmark
|
||||
start = time.time()
|
||||
for _ in range(100):
|
||||
session.run(None, {input_name: test_data})
|
||||
end = time.time()
|
||||
|
||||
latency = (end - start) / 100 * 1000 # ms per inference
|
||||
print(f"ONNX Quantized latency: {latency:.2f}ms")
|
||||
```
|
||||
|
||||
**ONNX advantages:**
|
||||
- Cross-framework (works with PyTorch, TensorFlow, etc.)
|
||||
- Optimized ONNX Runtime for CPU inference
|
||||
- Good hardware backend support (x86, ARM)
|
||||
|
||||
|
||||
## Part 6: Accuracy Validation (Critical Step)
|
||||
|
||||
### Why Accuracy Validation Matters
|
||||
|
||||
Quantization is **lossy compression**. Must measure accuracy impact:
|
||||
- Some models tolerate quantization well (<0.5% loss)
|
||||
- Some models sensitive to quantization (>2% loss)
|
||||
- Some layers more sensitive than others
|
||||
- **Can't assume quantization is safe without measuring**
|
||||
|
||||
### Validation Methodology
|
||||
|
||||
```python
|
||||
def validate_quantization(original_model, quantized_model, val_loader):
|
||||
"""
|
||||
Validate quantization by comparing accuracy.
|
||||
|
||||
WHY: Quantization is lossy - must measure impact.
|
||||
Compare baseline vs quantized on same validation set.
|
||||
|
||||
Returns:
|
||||
dict with baseline_acc, quantized_acc, accuracy_loss
|
||||
"""
|
||||
def evaluate(model, data_loader):
|
||||
model.eval()
|
||||
correct = 0
|
||||
total = 0
|
||||
with torch.no_grad():
|
||||
for data, target in data_loader:
|
||||
output = model(data)
|
||||
pred = output.argmax(dim=1)
|
||||
correct += (pred == target).sum().item()
|
||||
total += target.size(0)
|
||||
return 100.0 * correct / total
|
||||
|
||||
baseline_acc = evaluate(original_model, val_loader)
|
||||
quantized_acc = evaluate(quantized_model, val_loader)
|
||||
accuracy_loss = baseline_acc - quantized_acc
|
||||
|
||||
return {
|
||||
'baseline_acc': baseline_acc,
|
||||
'quantized_acc': quantized_acc,
|
||||
'accuracy_loss': accuracy_loss,
|
||||
'acceptable': accuracy_loss < 2.0 # WHY: <2% loss usually acceptable
|
||||
}
|
||||
|
||||
# Example validation
|
||||
results = validate_quantization(original_model, quantized_model, val_loader)
|
||||
print(f"Baseline accuracy: {results['baseline_acc']:.2f}%")
|
||||
print(f"Quantized accuracy: {results['quantized_acc']:.2f}%")
|
||||
print(f"Accuracy loss: {results['accuracy_loss']:.2f}%")
|
||||
print(f"Acceptable: {results['acceptable']}")
|
||||
|
||||
# Decision logic
|
||||
if results['acceptable']:
|
||||
print("✅ Quantization acceptable - deploy quantized model")
|
||||
else:
|
||||
print("❌ Accuracy loss too large - try QAT or reconsider quantization")
|
||||
```
|
||||
|
||||
### Acceptable Accuracy Thresholds
|
||||
|
||||
**General guidelines:**
|
||||
- **<1% loss:** Excellent quantization result
|
||||
- **1-2% loss:** Acceptable for most applications
|
||||
- **2-3% loss:** Consider QAT to reduce loss
|
||||
- **>3% loss:** Quantization may not be suitable for this model
|
||||
|
||||
**Task-specific thresholds:**
|
||||
- Image classification: 1-2% top-1 accuracy loss acceptable
|
||||
- Object detection: 1-2% mAP loss acceptable
|
||||
- NLP classification: 0.5-1% accuracy loss acceptable
|
||||
- Medical/safety-critical: <0.5% loss required (use QAT)
|
||||
|
||||
|
||||
## Part 7: LLM Quantization (GPTQ, AWQ)
|
||||
|
||||
**Note:** This skill covers general quantization. For LLM-specific optimization (GPTQ, AWQ, KV cache, etc.), see the `llm-inference-optimization` skill in the llm-specialist pack.
|
||||
|
||||
### LLM Quantization Overview
|
||||
|
||||
**Why LLMs need quantization:**
|
||||
- Very large (7B parameters = 13GB in FP16)
|
||||
- Memory-bound inference (limited by VRAM)
|
||||
- INT4 quantization: 13GB → 3.5GB (fits in consumer GPUs)
|
||||
|
||||
**LLM-specific quantization methods:**
|
||||
- **GPTQ:** Post-training quantization optimized for LLMs
|
||||
- **AWQ:** Activation-aware weight quantization (better quality than GPTQ)
|
||||
- **Both:** Achieve INT4 with <0.5 perplexity increase
|
||||
|
||||
### When to Use LLM Quantization
|
||||
|
||||
✅ **Use when:**
|
||||
- Deploying LLMs locally (consumer GPUs)
|
||||
- Memory-constrained (need to fit in 12GB/24GB VRAM)
|
||||
- Cost-sensitive (smaller models cheaper to host)
|
||||
- Latency-sensitive (smaller models faster to load)
|
||||
|
||||
❌ **Don't use when:**
|
||||
- Have sufficient GPU memory for FP16
|
||||
- Accuracy critical (medical, legal applications)
|
||||
- Already using API (OpenAI, Anthropic) - they handle optimization
|
||||
|
||||
### LLM Quantization References
|
||||
|
||||
For detailed LLM quantization:
|
||||
- **See skill:** `llm-inference-optimization` (llm-specialist pack)
|
||||
- **Covers:** GPTQ, AWQ, KV cache optimization, token streaming
|
||||
- **Tools:** llama.cpp, vLLM, text-generation-inference
|
||||
|
||||
**Quick reference (defer to llm-specialist for details):**
|
||||
|
||||
```python
|
||||
# GPTQ quantization (example - see llm-specialist for full details)
|
||||
from transformers import AutoModelForCausalLM, GPTQConfig
|
||||
|
||||
# WHY: GPTQ optimizes layer-wise for minimal perplexity increase
|
||||
quantization_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-2-7b-hf",
|
||||
quantization_config=quantization_config,
|
||||
device_map="auto"
|
||||
)
|
||||
|
||||
# Result: 13GB → 3.5GB, <0.5 perplexity increase
|
||||
```
|
||||
|
||||
|
||||
## Part 8: When NOT to Quantize
|
||||
|
||||
### Scenario 1: Already Fast Enough
|
||||
|
||||
**Example:** MobileNetV2 (14MB, 3ms CPU latency)
|
||||
- Quantization: 14MB → 4MB, 3ms → 2ms
|
||||
- **Benefit:** 10MB saved, 1ms faster
|
||||
- **Cost:** Calibration, validation, testing, debugging
|
||||
- **Decision:** Not worth effort unless specific requirement
|
||||
|
||||
**Rule:** If current performance meets requirements, don't optimize.
|
||||
|
||||
### Scenario 2: GPU-Only Deployment with No Memory Constraints
|
||||
|
||||
**Example:** ResNet50 on Tesla V100 with 32GB VRAM
|
||||
- Quantization: 1.5-2× GPU speedup (modest)
|
||||
- FP32 already fast on GPU (Tensor Cores optimized)
|
||||
- No memory pressure (plenty of VRAM)
|
||||
- **Decision:** Focus on other bottlenecks (data loading, I/O)
|
||||
|
||||
**Rule:** Quantization is most beneficial for CPU inference and memory-constrained GPU.
|
||||
|
||||
### Scenario 3: Accuracy-Critical Applications
|
||||
|
||||
**Example:** Medical diagnosis model where misdiagnosis has severe consequences
|
||||
- Quantization introduces accuracy loss (even if small)
|
||||
- Risk not worth benefit
|
||||
- **Decision:** Keep FP32, optimize other parts (batching, caching)
|
||||
|
||||
**Rule:** Safety-critical systems should avoid lossy compression unless thoroughly validated.
|
||||
|
||||
### Scenario 4: Prototyping Phase
|
||||
|
||||
**Example:** Early development, trying different architectures
|
||||
- Quantization is optimization - premature at prototype stage
|
||||
- Focus on getting model working first
|
||||
- **Decision:** Defer quantization until production deployment
|
||||
|
||||
**Rule:** Don't optimize until you need to (Knuth: "Premature optimization is root of all evil").
|
||||
|
||||
|
||||
## Part 9: Quantization Benchmarks (Expected Results)
|
||||
|
||||
### Image Classification (ResNet50, ImageNet)
|
||||
|
||||
| Metric | FP32 Baseline | Dynamic INT8 | Static INT8 | QAT INT8 |
|
||||
|--------|---------------|--------------|-------------|----------|
|
||||
| Size | 98MB | 25MB (4×) | 25MB (4×) | 25MB (4×) |
|
||||
| CPU Latency | 15ms | 12ms (1.25×) | 4ms (3.75×) | 4ms (3.75×) |
|
||||
| Top-1 Accuracy | 76.1% | 75.9% (0.2% loss) | 75.3% (0.8% loss) | 75.9% (0.2% loss) |
|
||||
|
||||
**Insight:** Static quantization gives 3.75× speedup with acceptable 0.8% accuracy loss.
|
||||
|
||||
### Object Detection (YOLOv5s, COCO)
|
||||
|
||||
| Metric | FP32 Baseline | Static INT8 | QAT INT8 |
|
||||
|--------|---------------|-------------|----------|
|
||||
| Size | 14MB | 4MB (3.5×) | 4MB (3.5×) |
|
||||
| CPU Latency | 45ms | 15ms (3×) | 15ms (3×) |
|
||||
| mAP@0.5 | 37.4% | 36.8% (0.6% loss) | 37.2% (0.2% loss) |
|
||||
|
||||
**Insight:** QAT gives better accuracy (0.2% vs 0.6% loss) with same speedup.
|
||||
|
||||
### NLP Classification (BERT-base, GLUE)
|
||||
|
||||
| Metric | FP32 Baseline | Dynamic INT8 | Static INT8 |
|
||||
|--------|---------------|--------------|-------------|
|
||||
| Size | 440MB | 110MB (4×) | 110MB (4×) |
|
||||
| CPU Latency | 35ms | 28ms (1.25×) | 12ms (2.9×) |
|
||||
| Accuracy | 93.5% | 93.2% (0.3% loss) | 92.8% (0.7% loss) |
|
||||
|
||||
**Insight:** Static quantization gives 2.9× speedup but dynamic sufficient if speedup not critical.
|
||||
|
||||
### LLM Inference (LLaMA-7B)
|
||||
|
||||
| Metric | FP16 Baseline | GPTQ INT4 | AWQ INT4 |
|
||||
|--------|---------------|-----------|----------|
|
||||
| Size | 13GB | 3.5GB (3.7×) | 3.5GB (3.7×) |
|
||||
| First Token Latency | 800ms | 250ms (3.2×) | 230ms (3.5×) |
|
||||
| Perplexity | 5.68 | 5.82 (0.14 increase) | 5.77 (0.09 increase) |
|
||||
|
||||
**Insight:** AWQ gives better quality than GPTQ with similar speedup.
|
||||
|
||||
|
||||
## Part 10: Common Pitfalls and Solutions
|
||||
|
||||
### Pitfall 1: Skipping Accuracy Validation
|
||||
|
||||
**Issue:** Deploy quantized model without measuring accuracy impact.
|
||||
**Risk:** Discover accuracy degradation in production (too late).
|
||||
**Solution:** Always validate accuracy on representative data before deployment.
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Deploy without validation
|
||||
quantized_model = quantize(model)
|
||||
deploy(quantized_model) # Hope it works!
|
||||
|
||||
# ✅ RIGHT: Validate before deployment
|
||||
quantized_model = quantize(model)
|
||||
results = validate_accuracy(original_model, quantized_model, val_loader)
|
||||
if results['acceptable']:
|
||||
deploy(quantized_model)
|
||||
else:
|
||||
print("Accuracy loss too large - try QAT")
|
||||
```
|
||||
|
||||
### Pitfall 2: Using Wrong Calibration Data
|
||||
|
||||
**Issue:** Calibrate with random/unrepresentative data.
|
||||
**Risk:** Activation ranges wrong → accuracy collapses.
|
||||
**Solution:** Use 100-1000 samples from validation set matching deployment distribution.
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Random images from internet
|
||||
calibration_data = download_random_images()
|
||||
|
||||
# ✅ RIGHT: Samples from validation set
|
||||
calibration_data = torch.utils.data.Subset(val_dataset, range(1000))
|
||||
```
|
||||
|
||||
### Pitfall 3: Choosing Wrong Quantization Type
|
||||
|
||||
**Issue:** Use dynamic quantization when need static speedup.
|
||||
**Risk:** Get 1.2× speedup instead of 3× speedup.
|
||||
**Solution:** Match quantization type to requirements (dynamic for size, static for speed).
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Use dynamic when need speed
|
||||
if need_fast_cpu_inference:
|
||||
quantized_model = torch.quantization.quantize_dynamic(model) # Only 1.2× speedup
|
||||
|
||||
# ✅ RIGHT: Use static for speed
|
||||
if need_fast_cpu_inference:
|
||||
model = prepare_and_calibrate(model, calibration_data)
|
||||
quantized_model = torch.quantization.convert(model) # 2-4× speedup
|
||||
```
|
||||
|
||||
### Pitfall 4: Quantizing GPU-Only Deployments
|
||||
|
||||
**Issue:** Quantize model for GPU inference without memory pressure.
|
||||
**Risk:** Effort not worth modest 1.5-2× GPU speedup.
|
||||
**Solution:** Only quantize GPU if memory-constrained (multiple models in VRAM).
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Quantize for GPU with no memory issue
|
||||
if deployment_target == "GPU" and have_plenty_of_memory:
|
||||
quantized_model = quantize(model) # Wasted effort
|
||||
|
||||
# ✅ RIGHT: Skip quantization if not needed
|
||||
if deployment_target == "GPU" and have_plenty_of_memory:
|
||||
deploy(model) # Keep FP32, focus on other optimizations
|
||||
```
|
||||
|
||||
### Pitfall 5: Over-Quantizing (INT4 When INT8 Sufficient)
|
||||
|
||||
**Issue:** Use aggressive INT4 quantization when INT8 would suffice.
|
||||
**Risk:** Larger accuracy loss than necessary.
|
||||
**Solution:** Start with INT8 (standard), only use INT4 if extreme memory constraints.
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Jump to INT4 without trying INT8
|
||||
quantized_model = quantize(model, precision="INT4") # 2-3% accuracy loss
|
||||
|
||||
# ✅ RIGHT: Start with INT8, only use INT4 if needed
|
||||
quantized_model_int8 = quantize(model, precision="INT8") # 0.5-1% accuracy loss
|
||||
if model_still_too_large:
|
||||
quantized_model_int4 = quantize(model, precision="INT4")
|
||||
```
|
||||
|
||||
### Pitfall 6: Assuming All Layers Quantize Equally
|
||||
|
||||
**Issue:** Quantize all layers uniformly, but some layers more sensitive.
|
||||
**Risk:** Accuracy loss dominated by few sensitive layers.
|
||||
**Solution:** Use mixed precision - keep sensitive layers in FP32/INT8, quantize others to INT4.
|
||||
|
||||
```python
|
||||
# ✅ ADVANCED: Mixed precision quantization
|
||||
# Keep first/last layers in higher precision, quantize middle layers aggressively
|
||||
from torch.quantization import QConfigMapping
|
||||
|
||||
qconfig_mapping = QConfigMapping()
|
||||
qconfig_mapping.set_global(get_default_qconfig('fbgemm')) # INT8 default
|
||||
qconfig_mapping.set_module_name('model.layer1', None) # Keep first layer FP32
|
||||
qconfig_mapping.set_module_name('model.layer10', None) # Keep last layer FP32
|
||||
|
||||
model = quantize_with_qconfig(model, qconfig_mapping)
|
||||
```
|
||||
|
||||
|
||||
## Part 11: Decision Framework Summary
|
||||
|
||||
### Step 1: Recognize Quantization Need
|
||||
|
||||
**Symptoms:**
|
||||
- Model too slow on CPU (>10ms when need <5ms)
|
||||
- Model too large for edge devices (>50MB)
|
||||
- Deploying to CPU/edge (not GPU)
|
||||
- Need to reduce hosting costs
|
||||
|
||||
**If YES → Proceed to Step 2**
|
||||
**If NO → Don't quantize, focus on other optimizations**
|
||||
|
||||
### Step 2: Choose Quantization Type
|
||||
|
||||
```
|
||||
Start with Dynamic:
|
||||
├─ Sufficient? (meets latency/size requirements)
|
||||
│ ├─ YES → Deploy dynamic quantized model
|
||||
│ └─ NO → Proceed to Static
|
||||
│
|
||||
Static Quantization:
|
||||
├─ Sufficient? (meets latency/size + accuracy acceptable)
|
||||
│ ├─ YES → Deploy static quantized model
|
||||
│ └─ NO → Accuracy loss >2%
|
||||
│ │
|
||||
│ └─ Proceed to QAT
|
||||
│
|
||||
QAT:
|
||||
├─ Train with quantization awareness
|
||||
└─ Achieves <1% accuracy loss → Deploy
|
||||
```
|
||||
|
||||
### Step 3: Calibrate (if Static/QAT)
|
||||
|
||||
**Calibration data:**
|
||||
- Source: Validation set (representative samples)
|
||||
- Size: 100-1000 samples
|
||||
- Characteristics: Match deployment distribution
|
||||
|
||||
**Calibration process:**
|
||||
1. Select samples from validation set
|
||||
2. Run through model to collect activation ranges
|
||||
3. Validate accuracy after calibration
|
||||
4. If accuracy loss >2%, try different calibration data or QAT
|
||||
|
||||
### Step 4: Validate Accuracy
|
||||
|
||||
**Required measurements:**
|
||||
- Baseline accuracy (FP32)
|
||||
- Quantized accuracy (INT8/INT4)
|
||||
- Accuracy loss (baseline - quantized)
|
||||
- Acceptable threshold (typically <2%)
|
||||
|
||||
**Decision:**
|
||||
- If accuracy loss <2% → Deploy
|
||||
- If accuracy loss >2% → Try QAT or reconsider quantization
|
||||
|
||||
### Step 5: Benchmark Performance
|
||||
|
||||
**Required measurements:**
|
||||
- Model size (MB): baseline vs quantized
|
||||
- Inference latency (ms): baseline vs quantized
|
||||
- Throughput (requests/sec): baseline vs quantized
|
||||
|
||||
**Verify expected results:**
|
||||
- Size: 4× reduction (FP32 → INT8)
|
||||
- CPU speedup: 2-4× (static quantization)
|
||||
- GPU speedup: 1.5-2× (if applicable)
|
||||
|
||||
|
||||
## Part 12: Production Deployment Checklist
|
||||
|
||||
Before deploying quantized model to production:
|
||||
|
||||
**✅ Accuracy Validated**
|
||||
- [ ] Baseline accuracy measured on validation set
|
||||
- [ ] Quantized accuracy measured on same validation set
|
||||
- [ ] Accuracy loss within acceptable threshold (<2%)
|
||||
- [ ] Validated on representative production data
|
||||
|
||||
**✅ Performance Benchmarked**
|
||||
- [ ] Size reduction measured (expect 4× for INT8)
|
||||
- [ ] Latency improvement measured (expect 2-4× CPU)
|
||||
- [ ] Throughput improvement measured
|
||||
- [ ] Performance meets requirements
|
||||
|
||||
**✅ Calibration Verified** (if static/QAT)
|
||||
- [ ] Used representative samples from validation set (not random data)
|
||||
- [ ] Used sufficient calibration data (100-1000 samples)
|
||||
- [ ] Calibration data matches deployment distribution
|
||||
|
||||
**✅ Edge Cases Tested**
|
||||
- [ ] Tested on diverse inputs (bright/dark images, long/short text)
|
||||
- [ ] Validated numerical stability (no NaN/Inf outputs)
|
||||
- [ ] Tested inference on target hardware (CPU/GPU/edge device)
|
||||
|
||||
**✅ Rollback Plan**
|
||||
- [ ] Can easily revert to FP32 model if issues found
|
||||
- [ ] Monitoring in place to detect accuracy degradation
|
||||
- [ ] A/B testing plan to compare FP32 vs quantized
|
||||
|
||||
|
||||
## Skill Mastery Checklist
|
||||
|
||||
You have mastered quantization for inference when you can:
|
||||
|
||||
- [ ] Recognize when quantization is appropriate (CPU/edge deployment, size/speed issues)
|
||||
- [ ] Choose correct quantization type (dynamic vs static vs QAT) based on requirements
|
||||
- [ ] Implement dynamic quantization in PyTorch (5 lines of code)
|
||||
- [ ] Implement static quantization with proper calibration (20 lines of code)
|
||||
- [ ] Select appropriate calibration data (validation set, 100-1000 samples)
|
||||
- [ ] Validate accuracy trade-offs systematically (baseline vs quantized)
|
||||
- [ ] Benchmark performance improvements (size, latency, throughput)
|
||||
- [ ] Decide when NOT to quantize (GPU-only, already fast, accuracy-critical)
|
||||
- [ ] Debug quantization issues (accuracy collapse, wrong speedup, numerical instability)
|
||||
- [ ] Deploy quantized models to production with confidence
|
||||
|
||||
**Key insight:** Quantization is not magic - it's a systematic trade-off of precision for performance. The skill is matching the right quantization approach to your specific requirements.
|
||||
2823
skills/using-ml-production/scaling-and-load-balancing.md
Normal file
2823
skills/using-ml-production/scaling-and-load-balancing.md
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user