Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:24:38 +08:00
commit b41966ed51
12 changed files with 3508 additions and 0 deletions

View File

@@ -0,0 +1,524 @@
# Cloudflare Workers AI - Best Practices
Production-tested patterns for building reliable, cost-effective AI applications with Workers AI.
---
## Table of Contents
1. [Streaming Best Practices](#streaming-best-practices)
2. [Error Handling](#error-handling)
3. [Cost Optimization](#cost-optimization)
4. [Performance Optimization](#performance-optimization)
5. [Security](#security)
6. [Monitoring & Observability](#monitoring--observability)
7. [Production Checklist](#production-checklist)
---
## Streaming Best Practices
### Why Streaming is Essential
**❌ Without streaming:**
- Buffers entire response in memory
- Higher latency (wait for complete response)
- Risk of Worker timeout (30s default)
- Poor user experience for long content
**✅ With streaming:**
- Immediate first token
- Lower memory usage
- Better UX (progressive rendering)
- No timeout issues
### Implementation
```typescript
// Always use stream: true for text generation
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: prompt }],
stream: true, // CRITICAL
});
return new Response(stream, {
headers: { 'content-type': 'text/event-stream' },
});
```
### Client-Side Handling
```typescript
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ prompt }),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
// Update UI with chunk
}
```
---
## Error Handling
### 1. Rate Limit Errors (429)
**Pattern: Exponential Backoff**
```typescript
async function runWithRetry(
ai: Ai,
model: string,
inputs: any,
maxRetries = 3
): Promise<any> {
let delay = 1000;
for (let i = 0; i < maxRetries; i++) {
try {
return await ai.run(model, inputs);
} catch (error) {
const message = (error as Error).message.toLowerCase();
if (message.includes('429') || message.includes('rate limit')) {
if (i < maxRetries - 1) {
await new Promise((resolve) => setTimeout(resolve, delay));
delay *= 2; // Exponential backoff: 1s, 2s, 4s
continue;
}
}
throw error;
}
}
}
```
### 2. Model Unavailable
**Pattern: Fallback Models**
```typescript
const models = [
'@cf/meta/llama-3.1-8b-instruct', // Primary
'@cf/meta/llama-3.2-1b-instruct', // Fallback (faster)
'@cf/qwen/qwen1.5-7b-chat-awq', // Fallback (alternative)
];
async function runWithFallback(ai: Ai, inputs: any): Promise<any> {
for (const model of models) {
try {
return await ai.run(model, inputs);
} catch (error) {
const message = (error as Error).message.toLowerCase();
if (!message.includes('unavailable')) throw error;
// Try next model
}
}
throw new Error('All models unavailable');
}
```
### 3. Token Limit Exceeded
**Pattern: Input Validation**
```typescript
function estimateTokens(text: string): number {
return Math.ceil(text.length / 4);
}
function validateInput(text: string, maxTokens = 2048): void {
const tokens = estimateTokens(text);
if (tokens > maxTokens) {
throw new Error(
`Input too long: ${tokens} tokens (max: ${maxTokens})`
);
}
}
// Usage
try {
validateInput(userInput);
const response = await env.AI.run(model, { prompt: userInput });
} catch (error) {
return c.json({ error: (error as Error).message }, 400);
}
```
---
## Cost Optimization
### 1. Use AI Gateway for Caching
**Without AI Gateway:**
- Same prompt = new inference = cost
**With AI Gateway:**
- Same prompt = cached response = free
```typescript
const response = await env.AI.run(
'@cf/meta/llama-3.1-8b-instruct',
{ messages },
{ gateway: { id: 'my-gateway' } } // Enable caching
);
```
**Savings**: 50-90% for repeated queries
### 2. Choose the Right Model
**Cost Comparison** (per 1M output tokens):
| Model | Cost | Use Case |
|-------|------|----------|
| Llama 3.2 1B | $0.201 | Simple tasks, high volume |
| Llama 3.1 8B | $0.606 | General purpose |
| Qwen 1.5 14B | $1.20+ | Complex reasoning |
**Strategy**: Use smallest model that meets quality requirements
### 3. Limit Output Length
```typescript
const response = await env.AI.run(model, {
messages,
max_tokens: 256, // Limit output (default: varies)
});
```
### 4. Batch Embeddings
```typescript
// ❌ Bad: 100 separate requests
for (const text of texts) {
await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [text] });
}
// ✅ Good: 1 batch request
await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: texts, // Up to 100 texts per request
});
```
### 5. Monitor Neurons Usage
```typescript
app.use('*', async (c, next) => {
const start = Date.now();
await next();
console.log({
path: c.req.path,
duration: Date.now() - start,
logId: c.env.AI.aiGatewayLogId, // Check dashboard for neurons
});
});
```
---
## Performance Optimization
### 1. Use Faster Models When Appropriate
**Speed Ranking** (fastest to slowest):
1. `@cf/qwen/qwen1.5-0.5b-chat` (1500/min limit)
2. `@cf/meta/llama-3.2-1b-instruct`
3. `@cf/tinyllama/tinyllama-1.1b-chat-v1.0` (720/min)
4. `@hf/thebloke/mistral-7b-instruct-v0.1-awq` (400/min)
5. `@cf/meta/llama-3.1-8b-instruct`
6. `@cf/qwen/qwen1.5-14b-chat-awq` (150/min)
### 2. Parallel Requests
```typescript
// Process multiple tasks in parallel
const [summary, keywords, sentiment] = await Promise.all([
env.AI.run(model, { prompt: `Summarize: ${text}` }),
env.AI.run(model, { prompt: `Extract keywords: ${text}` }),
env.AI.run(model, { prompt: `Sentiment: ${text}` }),
]);
```
### 3. Edge Caching for Static Prompts
```typescript
// Cache AI responses in KV
const cacheKey = `ai:${hash(prompt)}`;
let response = await env.CACHE.get(cacheKey);
if (!response) {
const result = await env.AI.run(model, { prompt });
response = result.response;
await env.CACHE.put(cacheKey, response, { expirationTtl: 3600 });
}
```
---
## Security
### 1. Never Expose API Keys
**❌ Bad:**
```typescript
const openai = new OpenAI({
apiKey: 'sk-1234...', // Hardcoded!
});
```
**✅ Good:**
```typescript
const openai = new OpenAI({
apiKey: env.OPENAI_API_KEY, // Environment variable
});
```
### 2. Input Sanitization
```typescript
function sanitizeInput(text: string): string {
// Remove potential prompt injection attempts
return text
.replace(/\{system\}/gi, '')
.replace(/\{assistant\}/gi, '')
.trim();
}
const prompt = sanitizeInput(userInput);
```
### 3. Rate Limiting Per User
```typescript
import { RateLimiter } from '@/lib/rate-limiter';
const limiter = new RateLimiter({
requests: 10,
window: 60, // 10 requests per minute
});
app.post('/chat', async (c) => {
const userId = c.req.header('x-user-id');
if (!await limiter.check(userId)) {
return c.json({ error: 'Rate limit exceeded' }, 429);
}
// Process request...
});
```
### 4. Content Filtering
```typescript
const BLOCKED_PATTERNS = [
/generate.*exploit/i,
/create.*malware/i,
/hack/i,
];
function isSafePrompt(prompt: string): boolean {
return !BLOCKED_PATTERNS.some((pattern) => pattern.test(prompt));
}
```
---
## Monitoring & Observability
### 1. Structured Logging
```typescript
interface AILog {
timestamp: string;
model: string;
duration: number;
success: boolean;
error?: string;
logId?: string;
}
async function logAIRequest(log: AILog): Promise<void> {
console.log(JSON.stringify(log));
// Or send to logging service (Datadog, Sentry, etc.)
}
```
### 2. Error Tracking
```typescript
app.onError((err, c) => {
console.error({
error: err.message,
stack: err.stack,
path: c.req.path,
timestamp: new Date().toISOString(),
});
return c.json({ error: 'Internal server error' }, 500);
});
```
### 3. Performance Metrics
```typescript
const metrics = {
requests: 0,
errors: 0,
totalDuration: 0,
};
app.use('*', async (c, next) => {
metrics.requests++;
const start = Date.now();
try {
await next();
} catch (error) {
metrics.errors++;
throw error;
} finally {
metrics.totalDuration += Date.now() - start;
}
});
app.get('/metrics', (c) => {
return c.json({
...metrics,
avgDuration: metrics.totalDuration / metrics.requests,
errorRate: (metrics.errors / metrics.requests) * 100,
});
});
```
---
## Production Checklist
### Before Deploying
- [ ] **Streaming enabled** for all text generation endpoints
- [ ] **AI Gateway configured** for cost tracking and caching
- [ ] **Error handling** with retry logic for rate limits
- [ ] **Input validation** to prevent token limit errors
- [ ] **Rate limiting** implemented per user
- [ ] **Monitoring** and logging configured
- [ ] **Model selection** optimized for cost/quality balance
- [ ] **Fallback models** configured for high availability
- [ ] **Security** review completed (input sanitization, content filtering)
- [ ] **Load testing** completed with expected traffic
- [ ] **Cost estimation** based on expected usage
- [ ] **Documentation** for API endpoints and error codes
### Cost Planning
**Estimate your costs:**
1. Expected requests/day: _____
2. Avg tokens per request (input + output): _____
3. Model neurons cost: _____
4. Daily neurons = requests × tokens × neurons_per_token
5. Daily cost = (daily neurons - 10,000 free) / 1,000 × $0.011
**Example:**
- 10,000 requests/day
- 500 tokens/request (avg)
- Llama 3.1 8B: ~50 neurons/1K tokens
- Daily neurons: 10,000 × 0.5K × 50 = 250,000
- Daily cost: (250,000 - 10,000) / 1,000 × $0.011 = **$2.64**
### Performance Targets
- **Time to first token**: <500ms
- **Avg response time**: <2s (streaming)
- **Error rate**: <1%
- **Cache hit rate**: >50% (with AI Gateway)
---
## Common Patterns
### 1. RAG Pattern
```typescript
// 1. Generate query embedding
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: [query],
});
// 2. Search Vectorize
const results = await env.VECTORIZE.query(embeddings.data[0], {
topK: 3,
});
// 3. Build context
const context = results.matches.map((m) => m.metadata.text).join('\n\n');
// 4. Generate response with context
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'system', content: `Context:\n${context}` },
{ role: 'user', content: query },
],
stream: true,
});
```
### 2. Multi-Model Consensus
```typescript
const models = [
'@cf/meta/llama-3.1-8b-instruct',
'@cf/qwen/qwen1.5-7b-chat-awq',
'@hf/thebloke/mistral-7b-instruct-v0.1-awq',
];
const responses = await Promise.all(
models.map((model) => env.AI.run(model, { prompt }))
);
// Combine or compare responses
```
### 3. Progressive Enhancement
```typescript
// Start with fast model, upgrade if needed
let response = await env.AI.run('@cf/meta/llama-3.2-1b-instruct', {
prompt,
});
// Check quality (length, coherence, etc.)
if (response.response.length < 50) {
// Retry with better model
response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
prompt,
});
}
```
---
## References
- [Workers AI Documentation](https://developers.cloudflare.com/workers-ai/)
- [AI Gateway](https://developers.cloudflare.com/ai-gateway/)
- [Pricing](https://developers.cloudflare.com/workers-ai/platform/pricing/)
- [Limits](https://developers.cloudflare.com/workers-ai/platform/limits/)
- [Models Catalog](https://developers.cloudflare.com/workers-ai/models/)

View File

@@ -0,0 +1,245 @@
# Cloudflare Workers AI - Models Catalog
Complete catalog of Workers AI models organized by task type.
**Last Updated**: 2025-10-21
**Official Catalog**: https://developers.cloudflare.com/workers-ai/models/
---
## Text Generation (LLMs)
### Meta Llama Models
| Model ID | Size | Best For | Rate Limit |
|----------|------|----------|------------|
| `@cf/meta/llama-3.1-8b-instruct` | 8B | General purpose, balanced | 300/min |
| `@cf/meta/llama-3.1-8b-instruct-fast` | 8B | Faster inference | 300/min |
| `@cf/meta/llama-3.2-1b-instruct` | 1B | Ultra-fast, simple tasks | 300/min |
| `@cf/meta/llama-3.2-3b-instruct` | 3B | Fast, good quality | 300/min |
| `@cf/meta/llama-2-7b-chat-int8` | 7B | Legacy, reliable | 300/min |
| `@cf/meta/llama-2-13b-chat-awq` | 13B | Higher quality (slower) | 300/min |
### Qwen Models
| Model ID | Size | Best For | Rate Limit |
|----------|------|----------|------------|
| `@cf/qwen/qwen1.5-14b-chat-awq` | 14B | High quality, complex reasoning | 150/min |
| `@cf/qwen/qwen1.5-7b-chat-awq` | 7B | Balanced quality/speed | 300/min |
| `@cf/qwen/qwen1.5-1.8b-chat` | 1.8B | Fast, lightweight | 720/min |
| `@cf/qwen/qwen1.5-0.5b-chat` | 0.5B | Ultra-fast, ultra-lightweight | 1500/min |
### Mistral Models
| Model ID | Size | Best For | Rate Limit |
|----------|------|----------|------------|
| `@hf/thebloke/mistral-7b-instruct-v0.1-awq` | 7B | Fast, efficient | 400/min |
| `@hf/thebloke/openhermes-2.5-mistral-7b-awq` | 7B | Instruction following | 300/min |
### DeepSeek Models
| Model ID | Size | Best For | Rate Limit |
|----------|------|----------|------------|
| `@cf/deepseek-ai/deepseek-r1-distill-qwen-32b` | 32B | Coding, technical content | 300/min |
| `@cf/deepseek-ai/deepseek-coder-6.7b-instruct-awq` | 6.7B | Code generation | 300/min |
### Other Models
| Model ID | Size | Best For | Rate Limit |
|----------|------|----------|------------|
| `@cf/tinyllama/tinyllama-1.1b-chat-v1.0` | 1.1B | Extremely fast, limited capability | 720/min |
| `@cf/microsoft/phi-2` | 2.7B | Fast, efficient | 720/min |
| `@cf/google/gemma-2b-it-lora` | 2B | Instruction tuned | 300/min |
| `@cf/google/gemma-7b-it-lora` | 7B | Higher quality | 300/min |
---
## Text Embeddings
| Model ID | Dimensions | Best For | Rate Limit |
|----------|-----------|----------|------------|
| `@cf/baai/bge-base-en-v1.5` | 768 | General purpose RAG | 3000/min |
| `@cf/baai/bge-large-en-v1.5` | 1024 | High accuracy search | 1500/min |
| `@cf/baai/bge-small-en-v1.5` | 384 | Fast, low storage | 3000/min |
| `@cf/baai/bge-m3` | 1024 | Multilingual | 3000/min |
**Use Case**: RAG, semantic search, similarity detection, clustering
---
## Image Generation
| Model ID | Type | Best For | Rate Limit |
|----------|------|----------|------------|
| `@cf/black-forest-labs/flux-1-schnell` | Text-to-Image | Photorealistic, high quality | 720/min |
| `@cf/stabilityai/stable-diffusion-xl-base-1.0` | Text-to-Image | General purpose | 720/min |
| `@cf/lykon/dreamshaper-8-lcm` | Text-to-Image | Artistic, stylized | 720/min |
| `@cf/runwayml/stable-diffusion-v1-5-img2img` | Image-to-Image | Transform images | 1500/min |
| `@cf/runwayml/stable-diffusion-v1-5-inpainting` | Inpainting | Edit specific areas | 1500/min |
| `@cf/bytedance/stable-diffusion-xl-lightning` | Text-to-Image | Fast generation | 720/min |
**Output**: PNG images (~5 MB max)
---
## Vision Models
| Model ID | Task | Best For | Rate Limit |
|----------|------|----------|------------|
| `@cf/meta/llama-3.2-11b-vision-instruct` | Image Understanding | Q&A, captioning, analysis | 720/min |
| `@cf/unum/uform-gen2-qwen-500m` | Image Captioning | Fast captions | 720/min |
**Input**: Base64-encoded images
---
## Translation
| Model ID | Languages | Rate Limit |
|----------|-----------|------------|
| `@cf/meta/m2m100-1.2b` | 100+ languages | 720/min |
**Supported Language Pairs**: https://developers.cloudflare.com/workers-ai/models/m2m100-1.2b/
---
## Text Classification
| Model ID | Task | Rate Limit |
|----------|------|------------|
| `@cf/huggingface/distilbert-sst-2-int8` | Sentiment analysis | 2000/min |
| `@hf/thebloke/openhermes-2.5-mistral-7b-awq` | General classification | 300/min |
**Output**: Label + confidence score
---
## Automatic Speech Recognition
| Model ID | Best For | Rate Limit |
|----------|----------|------------|
| `@cf/openai/whisper` | General transcription | 720/min |
| `@cf/openai/whisper-tiny-en` | English only, fast | 720/min |
**Input**: Audio files (MP3, WAV, etc.)
---
## Object Detection
| Model ID | Task | Rate Limit |
|----------|------|------------|
| `@cf/facebook/detr-resnet-50` | Object detection | 3000/min |
**Output**: Bounding boxes + labels
---
## Image Classification
| Model ID | Classes | Rate Limit |
|----------|---------|------------|
| `@cf/microsoft/resnet-50` | 1000 ImageNet classes | 3000/min |
**Output**: Top-5 predictions with probabilities
---
## Summarization
| Model ID | Best For | Rate Limit |
|----------|----------|------------|
| `@cf/facebook/bart-large-cnn` | News articles, documents | 1500/min |
---
## Text-to-Image (Legacy)
| Model ID | Type | Rate Limit |
|----------|------|------------|
| `@cf/stabilityai/stable-diffusion-v1-5-img2img` | Image-to-Image | 1500/min |
---
## Model Selection Guide
### For Text Generation
**Speed Priority:**
1. `@cf/qwen/qwen1.5-0.5b-chat` (1500/min)
2. `@cf/meta/llama-3.2-1b-instruct` (300/min)
3. `@cf/tinyllama/tinyllama-1.1b-chat-v1.0` (720/min)
**Quality Priority:**
1. `@cf/qwen/qwen1.5-14b-chat-awq` (150/min)
2. `@cf/deepseek-ai/deepseek-r1-distill-qwen-32b` (300/min)
3. `@cf/meta/llama-3.1-8b-instruct` (300/min)
**Balanced:**
1. `@cf/meta/llama-3.1-8b-instruct` (300/min)
2. `@hf/thebloke/mistral-7b-instruct-v0.1-awq` (400/min)
3. `@cf/qwen/qwen1.5-7b-chat-awq` (300/min)
### For Embeddings
**General Purpose RAG:**
- `@cf/baai/bge-base-en-v1.5` (768 dims, 3000/min)
**High Accuracy:**
- `@cf/baai/bge-large-en-v1.5` (1024 dims, 1500/min)
**Fast/Low Storage:**
- `@cf/baai/bge-small-en-v1.5` (384 dims, 3000/min)
### For Image Generation
**Best Quality:**
- `@cf/black-forest-labs/flux-1-schnell`
**General Purpose:**
- `@cf/stabilityai/stable-diffusion-xl-base-1.0`
**Artistic/Stylized:**
- `@cf/lykon/dreamshaper-8-lcm`
**Fast:**
- `@cf/bytedance/stable-diffusion-xl-lightning`
---
## Rate Limits Summary
| Task Type | Default Limit | High-Speed Models |
|-----------|---------------|-------------------|
| Text Generation | 300/min | 400-1500/min |
| Text Embeddings | 3000/min | 1500/min (large) |
| Image Generation | 720/min | 720/min |
| Vision Models | 720/min | 720/min |
| Translation | 720/min | 720/min |
| Classification | 2000/min | 2000/min |
| Speech Recognition | 720/min | 720/min |
| Object Detection | 3000/min | 3000/min |
---
## Pricing (Neurons)
Pricing varies by model. Common examples:
| Model | Input (1M tokens) | Output (1M tokens) |
|-------|-------------------|-------------------|
| Llama 3.2 1B | $0.027 | $0.201 |
| Llama 3.1 8B | $0.088 | $0.606 |
| BGE-base embeddings | $0.005 | N/A |
| Flux image gen | ~$0.011/image | N/A |
**Free Tier**: 10,000 neurons/day
**Paid Tier**: $0.011 per 1,000 neurons
---
## References
- [Official Models Catalog](https://developers.cloudflare.com/workers-ai/models/)
- [Rate Limits](https://developers.cloudflare.com/workers-ai/platform/limits/)
- [Pricing](https://developers.cloudflare.com/workers-ai/platform/pricing/)