Initial commit

2025-11-30 08:24:38 +08:00
commit b41966ed51
12 changed files with 3508 additions and 0 deletions
--- a/references/best-practices.md
+++ b/references/best-practices.md
@@ -0,0 +1,524 @@
+# Cloudflare Workers AI - Best Practices
+
+Production-tested patterns for building reliable, cost-effective AI applications with Workers AI.
+
+---
+
+## Table of Contents
+
+1. [Streaming Best Practices](#streaming-best-practices)
+2. [Error Handling](#error-handling)
+3. [Cost Optimization](#cost-optimization)
+4. [Performance Optimization](#performance-optimization)
+5. [Security](#security)
+6. [Monitoring & Observability](#monitoring--observability)
+7. [Production Checklist](#production-checklist)
+
+---
+
+## Streaming Best Practices
+
+### Why Streaming is Essential
+
+**❌ Without streaming:**
+- Buffers entire response in memory
+- Higher latency (wait for complete response)
+- Risk of Worker timeout (30s default)
+- Poor user experience for long content
+
+**✅ With streaming:**
+- Immediate first token
+- Lower memory usage
+- Better UX (progressive rendering)
+- No timeout issues
+
+### Implementation
+
+```typescript
+// Always use stream: true for text generation
+const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
+  messages: [{ role: 'user', content: prompt }],
+  stream: true, // CRITICAL
+});
+
+return new Response(stream, {
+  headers: { 'content-type': 'text/event-stream' },
+});
+```
+
+### Client-Side Handling
+
+```typescript
+const response = await fetch('/api/chat', {
+  method: 'POST',
+  body: JSON.stringify({ prompt }),
+});
+
+const reader = response.body.getReader();
+const decoder = new TextDecoder();
+
+while (true) {
+  const { done, value } = await reader.read();
+  if (done) break;
+
+  const chunk = decoder.decode(value);
+  // Update UI with chunk
+}
+```
+
+---
+
+## Error Handling
+
+### 1. Rate Limit Errors (429)
+
+**Pattern: Exponential Backoff**
+
+```typescript
+async function runWithRetry(
+  ai: Ai,
+  model: string,
+  inputs: any,
+  maxRetries = 3
+): Promise<any> {
+  let delay = 1000;
+
+  for (let i = 0; i < maxRetries; i++) {
+    try {
+      return await ai.run(model, inputs);
+    } catch (error) {
+      const message = (error as Error).message.toLowerCase();
+
+      if (message.includes('429') || message.includes('rate limit')) {
+        if (i < maxRetries - 1) {
+          await new Promise((resolve) => setTimeout(resolve, delay));
+          delay *= 2; // Exponential backoff: 1s, 2s, 4s
+          continue;
+        }
+      }
+
+      throw error;
+    }
+  }
+}
+```
+
+### 2. Model Unavailable
+
+**Pattern: Fallback Models**
+
+```typescript
+const models = [
+  '@cf/meta/llama-3.1-8b-instruct', // Primary
+  '@cf/meta/llama-3.2-1b-instruct', // Fallback (faster)
+  '@cf/qwen/qwen1.5-7b-chat-awq', // Fallback (alternative)
+];
+
+async function runWithFallback(ai: Ai, inputs: any): Promise<any> {
+  for (const model of models) {
+    try {
+      return await ai.run(model, inputs);
+    } catch (error) {
+      const message = (error as Error).message.toLowerCase();
+      if (!message.includes('unavailable')) throw error;
+      // Try next model
+    }
+  }
+
+  throw new Error('All models unavailable');
+}
+```
+
+### 3. Token Limit Exceeded
+
+**Pattern: Input Validation**
+
+```typescript
+function estimateTokens(text: string): number {
+  return Math.ceil(text.length / 4);
+}
+
+function validateInput(text: string, maxTokens = 2048): void {
+  const tokens = estimateTokens(text);
+
+  if (tokens > maxTokens) {
+    throw new Error(
+      `Input too long: ${tokens} tokens (max: ${maxTokens})`
+    );
+  }
+}
+
+// Usage
+try {
+  validateInput(userInput);
+  const response = await env.AI.run(model, { prompt: userInput });
+} catch (error) {
+  return c.json({ error: (error as Error).message }, 400);
+}
+```
+
+---
+
+## Cost Optimization
+
+### 1. Use AI Gateway for Caching
+
+**Without AI Gateway:**
+- Same prompt = new inference = cost
+
+**With AI Gateway:**
+- Same prompt = cached response = free
+
+```typescript
+const response = await env.AI.run(
+  '@cf/meta/llama-3.1-8b-instruct',
+  { messages },
+  { gateway: { id: 'my-gateway' } } // Enable caching
+);
+```
+
+**Savings**: 50-90% for repeated queries
+
+### 2. Choose the Right Model
+
+**Cost Comparison** (per 1M output tokens):
+
+| Model | Cost | Use Case |
+|-------|------|----------|
+| Llama 3.2 1B | $0.201 | Simple tasks, high volume |
+| Llama 3.1 8B | $0.606 | General purpose |
+| Qwen 1.5 14B | $1.20+ | Complex reasoning |
+
+**Strategy**: Use smallest model that meets quality requirements
+
+### 3. Limit Output Length
+
+```typescript
+const response = await env.AI.run(model, {
+  messages,
+  max_tokens: 256, // Limit output (default: varies)
+});
+```
+
+### 4. Batch Embeddings
+
+```typescript
+// ❌ Bad: 100 separate requests
+for (const text of texts) {
+  await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [text] });
+}
+
+// ✅ Good: 1 batch request
+await env.AI.run('@cf/baai/bge-base-en-v1.5', {
+  text: texts, // Up to 100 texts per request
+});
+```
+
+### 5. Monitor Neurons Usage
+
+```typescript
+app.use('*', async (c, next) => {
+  const start = Date.now();
+  await next();
+
+  console.log({
+    path: c.req.path,
+    duration: Date.now() - start,
+    logId: c.env.AI.aiGatewayLogId, // Check dashboard for neurons
+  });
+});
+```
+
+---
+
+## Performance Optimization
+
+### 1. Use Faster Models When Appropriate
+
+**Speed Ranking** (fastest to slowest):
+
+1. `@cf/qwen/qwen1.5-0.5b-chat` (1500/min limit)
+2. `@cf/meta/llama-3.2-1b-instruct`
+3. `@cf/tinyllama/tinyllama-1.1b-chat-v1.0` (720/min)
+4. `@hf/thebloke/mistral-7b-instruct-v0.1-awq` (400/min)
+5. `@cf/meta/llama-3.1-8b-instruct`
+6. `@cf/qwen/qwen1.5-14b-chat-awq` (150/min)
+
+### 2. Parallel Requests
+
+```typescript
+// Process multiple tasks in parallel
+const [summary, keywords, sentiment] = await Promise.all([
+  env.AI.run(model, { prompt: `Summarize: ${text}` }),
+  env.AI.run(model, { prompt: `Extract keywords: ${text}` }),
+  env.AI.run(model, { prompt: `Sentiment: ${text}` }),
+]);
+```
+
+### 3. Edge Caching for Static Prompts
+
+```typescript
+// Cache AI responses in KV
+const cacheKey = `ai:${hash(prompt)}`;
+
+let response = await env.CACHE.get(cacheKey);
+if (!response) {
+  const result = await env.AI.run(model, { prompt });
+  response = result.response;
+  await env.CACHE.put(cacheKey, response, { expirationTtl: 3600 });
+}
+```
+
+---
+
+## Security
+
+### 1. Never Expose API Keys
+
+**❌ Bad:**
+```typescript
+const openai = new OpenAI({
+  apiKey: 'sk-1234...', // Hardcoded!
+});
+```
+
+**✅ Good:**
+```typescript
+const openai = new OpenAI({
+  apiKey: env.OPENAI_API_KEY, // Environment variable
+});
+```
+
+### 2. Input Sanitization
+
+```typescript
+function sanitizeInput(text: string): string {
+  // Remove potential prompt injection attempts
+  return text
+    .replace(/\{system\}/gi, '')
+    .replace(/\{assistant\}/gi, '')
+    .trim();
+}
+
+const prompt = sanitizeInput(userInput);
+```
+
+### 3. Rate Limiting Per User
+
+```typescript
+import { RateLimiter } from '@/lib/rate-limiter';
+
+const limiter = new RateLimiter({
+  requests: 10,
+  window: 60, // 10 requests per minute
+});
+
+app.post('/chat', async (c) => {
+  const userId = c.req.header('x-user-id');
+
+  if (!await limiter.check(userId)) {
+    return c.json({ error: 'Rate limit exceeded' }, 429);
+  }
+
+  // Process request...
+});
+```
+
+### 4. Content Filtering
+
+```typescript
+const BLOCKED_PATTERNS = [
+  /generate.*exploit/i,
+  /create.*malware/i,
+  /hack/i,
+];
+
+function isSafePrompt(prompt: string): boolean {
+  return !BLOCKED_PATTERNS.some((pattern) => pattern.test(prompt));
+}
+```
+
+---
+
+## Monitoring & Observability
+
+### 1. Structured Logging
+
+```typescript
+interface AILog {
+  timestamp: string;
+  model: string;
+  duration: number;
+  success: boolean;
+  error?: string;
+  logId?: string;
+}
+
+async function logAIRequest(log: AILog): Promise<void> {
+  console.log(JSON.stringify(log));
+  // Or send to logging service (Datadog, Sentry, etc.)
+}
+```
+
+### 2. Error Tracking
+
+```typescript
+app.onError((err, c) => {
+  console.error({
+    error: err.message,
+    stack: err.stack,
+    path: c.req.path,
+    timestamp: new Date().toISOString(),
+  });
+
+  return c.json({ error: 'Internal server error' }, 500);
+});
+```
+
+### 3. Performance Metrics
+
+```typescript
+const metrics = {
+  requests: 0,
+  errors: 0,
+  totalDuration: 0,
+};
+
+app.use('*', async (c, next) => {
+  metrics.requests++;
+  const start = Date.now();
+
+  try {
+    await next();
+  } catch (error) {
+    metrics.errors++;
+    throw error;
+  } finally {
+    metrics.totalDuration += Date.now() - start;
+  }
+});
+
+app.get('/metrics', (c) => {
+  return c.json({
+    ...metrics,
+    avgDuration: metrics.totalDuration / metrics.requests,
+    errorRate: (metrics.errors / metrics.requests) * 100,
+  });
+});
+```
+
+---
+
+## Production Checklist
+
+### Before Deploying
+
+- [ ] **Streaming enabled** for all text generation endpoints
+- [ ] **AI Gateway configured** for cost tracking and caching
+- [ ] **Error handling** with retry logic for rate limits
+- [ ] **Input validation** to prevent token limit errors
+- [ ] **Rate limiting** implemented per user
+- [ ] **Monitoring** and logging configured
+- [ ] **Model selection** optimized for cost/quality balance
+- [ ] **Fallback models** configured for high availability
+- [ ] **Security** review completed (input sanitization, content filtering)
+- [ ] **Load testing** completed with expected traffic
+- [ ] **Cost estimation** based on expected usage
+- [ ] **Documentation** for API endpoints and error codes
+
+### Cost Planning
+
+**Estimate your costs:**
+
+1. Expected requests/day: _____
+2. Avg tokens per request (input + output): _____
+3. Model neurons cost: _____
+4. Daily neurons = requests × tokens × neurons_per_token
+5. Daily cost = (daily neurons - 10,000 free) / 1,000 × $0.011
+
+**Example:**
+- 10,000 requests/day
+- 500 tokens/request (avg)
+- Llama 3.1 8B: ~50 neurons/1K tokens
+- Daily neurons: 10,000 × 0.5K × 50 = 250,000
+- Daily cost: (250,000 - 10,000) / 1,000 × $0.011 = **$2.64**
+
+### Performance Targets
+
+- **Time to first token**: <500ms
+- **Avg response time**: <2s (streaming)
+- **Error rate**: <1%
+- **Cache hit rate**: >50% (with AI Gateway)
+
+---
+
+## Common Patterns
+
+### 1. RAG Pattern
+
+```typescript
+// 1. Generate query embedding
+const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
+  text: [query],
+});
+
+// 2. Search Vectorize
+const results = await env.VECTORIZE.query(embeddings.data[0], {
+  topK: 3,
+});
+
+// 3. Build context
+const context = results.matches.map((m) => m.metadata.text).join('\n\n');
+
+// 4. Generate response with context
+const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
+  messages: [
+    { role: 'system', content: `Context:\n${context}` },
+    { role: 'user', content: query },
+  ],
+  stream: true,
+});
+```
+
+### 2. Multi-Model Consensus
+
+```typescript
+const models = [
+  '@cf/meta/llama-3.1-8b-instruct',
+  '@cf/qwen/qwen1.5-7b-chat-awq',
+  '@hf/thebloke/mistral-7b-instruct-v0.1-awq',
+];
+
+const responses = await Promise.all(
+  models.map((model) => env.AI.run(model, { prompt }))
+);
+
+// Combine or compare responses
+```
+
+### 3. Progressive Enhancement
+
+```typescript
+// Start with fast model, upgrade if needed
+let response = await env.AI.run('@cf/meta/llama-3.2-1b-instruct', {
+  prompt,
+});
+
+// Check quality (length, coherence, etc.)
+if (response.response.length < 50) {
+  // Retry with better model
+  response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
+    prompt,
+  });
+}
+```
+
+---
+
+## References
+
+- [Workers AI Documentation](https://developers.cloudflare.com/workers-ai/)
+- [AI Gateway](https://developers.cloudflare.com/ai-gateway/)
+- [Pricing](https://developers.cloudflare.com/workers-ai/platform/pricing/)
+- [Limits](https://developers.cloudflare.com/workers-ai/platform/limits/)
+- [Models Catalog](https://developers.cloudflare.com/workers-ai/models/)
--- a/references/models-catalog.md
+++ b/references/models-catalog.md
@@ -0,0 +1,245 @@
+# Cloudflare Workers AI - Models Catalog
+
+Complete catalog of Workers AI models organized by task type.
+
+**Last Updated**: 2025-10-21
+**Official Catalog**: https://developers.cloudflare.com/workers-ai/models/
+
+---
+
+## Text Generation (LLMs)
+
+### Meta Llama Models
+
+| Model ID | Size | Best For | Rate Limit |
+|----------|------|----------|------------|
+| `@cf/meta/llama-3.1-8b-instruct` | 8B | General purpose, balanced | 300/min |
+| `@cf/meta/llama-3.1-8b-instruct-fast` | 8B | Faster inference | 300/min |
+| `@cf/meta/llama-3.2-1b-instruct` | 1B | Ultra-fast, simple tasks | 300/min |
+| `@cf/meta/llama-3.2-3b-instruct` | 3B | Fast, good quality | 300/min |
+| `@cf/meta/llama-2-7b-chat-int8` | 7B | Legacy, reliable | 300/min |
+| `@cf/meta/llama-2-13b-chat-awq` | 13B | Higher quality (slower) | 300/min |
+
+### Qwen Models
+
+| Model ID | Size | Best For | Rate Limit |
+|----------|------|----------|------------|
+| `@cf/qwen/qwen1.5-14b-chat-awq` | 14B | High quality, complex reasoning | 150/min |
+| `@cf/qwen/qwen1.5-7b-chat-awq` | 7B | Balanced quality/speed | 300/min |
+| `@cf/qwen/qwen1.5-1.8b-chat` | 1.8B | Fast, lightweight | 720/min |
+| `@cf/qwen/qwen1.5-0.5b-chat` | 0.5B | Ultra-fast, ultra-lightweight | 1500/min |
+
+### Mistral Models
+
+| Model ID | Size | Best For | Rate Limit |
+|----------|------|----------|------------|
+| `@hf/thebloke/mistral-7b-instruct-v0.1-awq` | 7B | Fast, efficient | 400/min |
+| `@hf/thebloke/openhermes-2.5-mistral-7b-awq` | 7B | Instruction following | 300/min |
+
+### DeepSeek Models
+
+| Model ID | Size | Best For | Rate Limit |
+|----------|------|----------|------------|
+| `@cf/deepseek-ai/deepseek-r1-distill-qwen-32b` | 32B | Coding, technical content | 300/min |
+| `@cf/deepseek-ai/deepseek-coder-6.7b-instruct-awq` | 6.7B | Code generation | 300/min |
+
+### Other Models
+
+| Model ID | Size | Best For | Rate Limit |
+|----------|------|----------|------------|
+| `@cf/tinyllama/tinyllama-1.1b-chat-v1.0` | 1.1B | Extremely fast, limited capability | 720/min |
+| `@cf/microsoft/phi-2` | 2.7B | Fast, efficient | 720/min |
+| `@cf/google/gemma-2b-it-lora` | 2B | Instruction tuned | 300/min |
+| `@cf/google/gemma-7b-it-lora` | 7B | Higher quality | 300/min |
+
+---
+
+## Text Embeddings
+
+| Model ID | Dimensions | Best For | Rate Limit |
+|----------|-----------|----------|------------|
+| `@cf/baai/bge-base-en-v1.5` | 768 | General purpose RAG | 3000/min |
+| `@cf/baai/bge-large-en-v1.5` | 1024 | High accuracy search | 1500/min |
+| `@cf/baai/bge-small-en-v1.5` | 384 | Fast, low storage | 3000/min |
+| `@cf/baai/bge-m3` | 1024 | Multilingual | 3000/min |
+
+**Use Case**: RAG, semantic search, similarity detection, clustering
+
+---
+
+## Image Generation
+
+| Model ID | Type | Best For | Rate Limit |
+|----------|------|----------|------------|
+| `@cf/black-forest-labs/flux-1-schnell` | Text-to-Image | Photorealistic, high quality | 720/min |
+| `@cf/stabilityai/stable-diffusion-xl-base-1.0` | Text-to-Image | General purpose | 720/min |
+| `@cf/lykon/dreamshaper-8-lcm` | Text-to-Image | Artistic, stylized | 720/min |
+| `@cf/runwayml/stable-diffusion-v1-5-img2img` | Image-to-Image | Transform images | 1500/min |
+| `@cf/runwayml/stable-diffusion-v1-5-inpainting` | Inpainting | Edit specific areas | 1500/min |
+| `@cf/bytedance/stable-diffusion-xl-lightning` | Text-to-Image | Fast generation | 720/min |
+
+**Output**: PNG images (~5 MB max)
+
+---
+
+## Vision Models
+
+| Model ID | Task | Best For | Rate Limit |
+|----------|------|----------|------------|
+| `@cf/meta/llama-3.2-11b-vision-instruct` | Image Understanding | Q&A, captioning, analysis | 720/min |
+| `@cf/unum/uform-gen2-qwen-500m` | Image Captioning | Fast captions | 720/min |
+
+**Input**: Base64-encoded images
+
+---
+
+## Translation
+
+| Model ID | Languages | Rate Limit |
+|----------|-----------|------------|
+| `@cf/meta/m2m100-1.2b` | 100+ languages | 720/min |
+
+**Supported Language Pairs**: https://developers.cloudflare.com/workers-ai/models/m2m100-1.2b/
+
+---
+
+## Text Classification
+
+| Model ID | Task | Rate Limit |
+|----------|------|------------|
+| `@cf/huggingface/distilbert-sst-2-int8` | Sentiment analysis | 2000/min |
+| `@hf/thebloke/openhermes-2.5-mistral-7b-awq` | General classification | 300/min |
+
+**Output**: Label + confidence score
+
+---
+
+## Automatic Speech Recognition
+
+| Model ID | Best For | Rate Limit |
+|----------|----------|------------|
+| `@cf/openai/whisper` | General transcription | 720/min |
+| `@cf/openai/whisper-tiny-en` | English only, fast | 720/min |
+
+**Input**: Audio files (MP3, WAV, etc.)
+
+---
+
+## Object Detection
+
+| Model ID | Task | Rate Limit |
+|----------|------|------------|
+| `@cf/facebook/detr-resnet-50` | Object detection | 3000/min |
+
+**Output**: Bounding boxes + labels
+
+---
+
+## Image Classification
+
+| Model ID | Classes | Rate Limit |
+|----------|---------|------------|
+| `@cf/microsoft/resnet-50` | 1000 ImageNet classes | 3000/min |
+
+**Output**: Top-5 predictions with probabilities
+
+---
+
+## Summarization
+
+| Model ID | Best For | Rate Limit |
+|----------|----------|------------|
+| `@cf/facebook/bart-large-cnn` | News articles, documents | 1500/min |
+
+---
+
+## Text-to-Image (Legacy)
+
+| Model ID | Type | Rate Limit |
+|----------|------|------------|
+| `@cf/stabilityai/stable-diffusion-v1-5-img2img` | Image-to-Image | 1500/min |
+
+---
+
+## Model Selection Guide
+
+### For Text Generation
+
+**Speed Priority:**
+1. `@cf/qwen/qwen1.5-0.5b-chat` (1500/min)
+2. `@cf/meta/llama-3.2-1b-instruct` (300/min)
+3. `@cf/tinyllama/tinyllama-1.1b-chat-v1.0` (720/min)
+
+**Quality Priority:**
+1. `@cf/qwen/qwen1.5-14b-chat-awq` (150/min)
+2. `@cf/deepseek-ai/deepseek-r1-distill-qwen-32b` (300/min)
+3. `@cf/meta/llama-3.1-8b-instruct` (300/min)
+
+**Balanced:**
+1. `@cf/meta/llama-3.1-8b-instruct` (300/min)
+2. `@hf/thebloke/mistral-7b-instruct-v0.1-awq` (400/min)
+3. `@cf/qwen/qwen1.5-7b-chat-awq` (300/min)
+
+### For Embeddings
+
+**General Purpose RAG:**
+- `@cf/baai/bge-base-en-v1.5` (768 dims, 3000/min)
+
+**High Accuracy:**
+- `@cf/baai/bge-large-en-v1.5` (1024 dims, 1500/min)
+
+**Fast/Low Storage:**
+- `@cf/baai/bge-small-en-v1.5` (384 dims, 3000/min)
+
+### For Image Generation
+
+**Best Quality:**
+- `@cf/black-forest-labs/flux-1-schnell`
+
+**General Purpose:**
+- `@cf/stabilityai/stable-diffusion-xl-base-1.0`
+
+**Artistic/Stylized:**
+- `@cf/lykon/dreamshaper-8-lcm`
+
+**Fast:**
+- `@cf/bytedance/stable-diffusion-xl-lightning`
+
+---
+
+## Rate Limits Summary
+
+| Task Type | Default Limit | High-Speed Models |
+|-----------|---------------|-------------------|
+| Text Generation | 300/min | 400-1500/min |
+| Text Embeddings | 3000/min | 1500/min (large) |
+| Image Generation | 720/min | 720/min |
+| Vision Models | 720/min | 720/min |
+| Translation | 720/min | 720/min |
+| Classification | 2000/min | 2000/min |
+| Speech Recognition | 720/min | 720/min |
+| Object Detection | 3000/min | 3000/min |
+
+---
+
+## Pricing (Neurons)
+
+Pricing varies by model. Common examples:
+
+| Model | Input (1M tokens) | Output (1M tokens) |
+|-------|-------------------|-------------------|
+| Llama 3.2 1B | $0.027 | $0.201 |
+| Llama 3.1 8B | $0.088 | $0.606 |
+| BGE-base embeddings | $0.005 | N/A |
+| Flux image gen | ~$0.011/image | N/A |
+
+**Free Tier**: 10,000 neurons/day
+**Paid Tier**: $0.011 per 1,000 neurons
+
+---
+
+## References
+
+- [Official Models Catalog](https://developers.cloudflare.com/workers-ai/models/)
+- [Rate Limits](https://developers.cloudflare.com/workers-ai/platform/limits/)
+- [Pricing](https://developers.cloudflare.com/workers-ai/platform/pricing/)