zhongwei/gh-jezweb-claude-skills-skills-cloudflare-workers-ai

Fork 0

Files

Zhongwei Li b41966ed51 Initial commit

2025-11-30 08:24:38 +08:00

11 KiB

Raw Blame History

Cloudflare Workers AI - Best Practices

Production-tested patterns for building reliable, cost-effective AI applications with Workers AI.

Streaming Best Practices
Error Handling
Cost Optimization
Performance Optimization
Security
Monitoring & Observability
Production Checklist

Streaming Best Practices

Why Streaming is Essential

❌ Without streaming:

Buffers entire response in memory
Higher latency (wait for complete response)
Risk of Worker timeout (30s default)
Poor user experience for long content

✅ With streaming:

Immediate first token
Lower memory usage
Better UX (progressive rendering)
No timeout issues

Implementation

// Always use stream: true for text generation
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [{ role: 'user', content: prompt }],
  stream: true, // CRITICAL
});

return new Response(stream, {
  headers: { 'content-type': 'text/event-stream' },
});

Client-Side Handling

const response = await fetch('/api/chat', {
  method: 'POST',
  body: JSON.stringify({ prompt }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = decoder.decode(value);
  // Update UI with chunk
}

Error Handling

1. Rate Limit Errors (429)

Pattern: Exponential Backoff

async function runWithRetry(
  ai: Ai,
  model: string,
  inputs: any,
  maxRetries = 3
): Promise<any> {
  let delay = 1000;

  for (let i = 0; i < maxRetries; i++) {
    try {
      return await ai.run(model, inputs);
    } catch (error) {
      const message = (error as Error).message.toLowerCase();

      if (message.includes('429') || message.includes('rate limit')) {
        if (i < maxRetries - 1) {
          await new Promise((resolve) => setTimeout(resolve, delay));
          delay *= 2; // Exponential backoff: 1s, 2s, 4s
          continue;
        }
      }

      throw error;
    }
  }
}

2. Model Unavailable

Pattern: Fallback Models

const models = [
  '@cf/meta/llama-3.1-8b-instruct', // Primary
  '@cf/meta/llama-3.2-1b-instruct', // Fallback (faster)
  '@cf/qwen/qwen1.5-7b-chat-awq', // Fallback (alternative)
];

async function runWithFallback(ai: Ai, inputs: any): Promise<any> {
  for (const model of models) {
    try {
      return await ai.run(model, inputs);
    } catch (error) {
      const message = (error as Error).message.toLowerCase();
      if (!message.includes('unavailable')) throw error;
      // Try next model
    }
  }

  throw new Error('All models unavailable');
}

3. Token Limit Exceeded

Pattern: Input Validation

function estimateTokens(text: string): number {
  return Math.ceil(text.length / 4);
}

function validateInput(text: string, maxTokens = 2048): void {
  const tokens = estimateTokens(text);

  if (tokens > maxTokens) {
    throw new Error(
      `Input too long: ${tokens} tokens (max: ${maxTokens})`
    );
  }
}

// Usage
try {
  validateInput(userInput);
  const response = await env.AI.run(model, { prompt: userInput });
} catch (error) {
  return c.json({ error: (error as Error).message }, 400);
}

Cost Optimization

1. Use AI Gateway for Caching

Without AI Gateway:

Same prompt = new inference = cost

With AI Gateway:

Same prompt = cached response = free

const response = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  { messages },
  { gateway: { id: 'my-gateway' } } // Enable caching
);

Savings: 50-90% for repeated queries

2. Choose the Right Model

Cost Comparison (per 1M output tokens):

Model	Cost	Use Case
Llama 3.2 1B	$0.201	Simple tasks, high volume
Llama 3.1 8B	$0.606	General purpose
Qwen 1.5 14B	$1.20+	Complex reasoning

Strategy: Use smallest model that meets quality requirements

3. Limit Output Length

const response = await env.AI.run(model, {
  messages,
  max_tokens: 256, // Limit output (default: varies)
});

4. Batch Embeddings

// ❌ Bad: 100 separate requests
for (const text of texts) {
  await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [text] });
}

// ✅ Good: 1 batch request
await env.AI.run('@cf/baai/bge-base-en-v1.5', {
  text: texts, // Up to 100 texts per request
});

5. Monitor Neurons Usage

app.use('*', async (c, next) => {
  const start = Date.now();
  await next();

  console.log({
    path: c.req.path,
    duration: Date.now() - start,
    logId: c.env.AI.aiGatewayLogId, // Check dashboard for neurons
  });
});

Performance Optimization

1. Use Faster Models When Appropriate

Speed Ranking (fastest to slowest):

@cf/qwen/qwen1.5-0.5b-chat (1500/min limit)
@cf/meta/llama-3.2-1b-instruct
@cf/tinyllama/tinyllama-1.1b-chat-v1.0 (720/min)
@hf/thebloke/mistral-7b-instruct-v0.1-awq (400/min)
@cf/meta/llama-3.1-8b-instruct
@cf/qwen/qwen1.5-14b-chat-awq (150/min)

2. Parallel Requests

// Process multiple tasks in parallel
const [summary, keywords, sentiment] = await Promise.all([
  env.AI.run(model, { prompt: `Summarize: ${text}` }),
  env.AI.run(model, { prompt: `Extract keywords: ${text}` }),
  env.AI.run(model, { prompt: `Sentiment: ${text}` }),
]);

3. Edge Caching for Static Prompts

// Cache AI responses in KV
const cacheKey = `ai:${hash(prompt)}`;

let response = await env.CACHE.get(cacheKey);
if (!response) {
  const result = await env.AI.run(model, { prompt });
  response = result.response;
  await env.CACHE.put(cacheKey, response, { expirationTtl: 3600 });
}

Security

1. Never Expose API Keys

❌ Bad:

const openai = new OpenAI({
  apiKey: 'sk-1234...', // Hardcoded!
});

✅ Good:

const openai = new OpenAI({
  apiKey: env.OPENAI_API_KEY, // Environment variable
});

2. Input Sanitization

function sanitizeInput(text: string): string {
  // Remove potential prompt injection attempts
  return text
    .replace(/\{system\}/gi, '')
    .replace(/\{assistant\}/gi, '')
    .trim();
}

const prompt = sanitizeInput(userInput);

3. Rate Limiting Per User

import { RateLimiter } from '@/lib/rate-limiter';

const limiter = new RateLimiter({
  requests: 10,
  window: 60, // 10 requests per minute
});

app.post('/chat', async (c) => {
  const userId = c.req.header('x-user-id');

  if (!await limiter.check(userId)) {
    return c.json({ error: 'Rate limit exceeded' }, 429);
  }

  // Process request...
});

4. Content Filtering

const BLOCKED_PATTERNS = [
  /generate.*exploit/i,
  /create.*malware/i,
  /hack/i,
];

function isSafePrompt(prompt: string): boolean {
  return !BLOCKED_PATTERNS.some((pattern) => pattern.test(prompt));
}

Monitoring & Observability

1. Structured Logging

interface AILog {
  timestamp: string;
  model: string;
  duration: number;
  success: boolean;
  error?: string;
  logId?: string;
}

async function logAIRequest(log: AILog): Promise<void> {
  console.log(JSON.stringify(log));
  // Or send to logging service (Datadog, Sentry, etc.)
}

2. Error Tracking

app.onError((err, c) => {
  console.error({
    error: err.message,
    stack: err.stack,
    path: c.req.path,
    timestamp: new Date().toISOString(),
  });

  return c.json({ error: 'Internal server error' }, 500);
});

3. Performance Metrics

const metrics = {
  requests: 0,
  errors: 0,
  totalDuration: 0,
};

app.use('*', async (c, next) => {
  metrics.requests++;
  const start = Date.now();

  try {
    await next();
  } catch (error) {
    metrics.errors++;
    throw error;
  } finally {
    metrics.totalDuration += Date.now() - start;
  }
});

app.get('/metrics', (c) => {
  return c.json({
    ...metrics,
    avgDuration: metrics.totalDuration / metrics.requests,
    errorRate: (metrics.errors / metrics.requests) * 100,
  });
});

Production Checklist

Before Deploying

Streaming enabled for all text generation endpoints
AI Gateway configured for cost tracking and caching
Error handling with retry logic for rate limits
Input validation to prevent token limit errors
Rate limiting implemented per user
Monitoring and logging configured
Model selection optimized for cost/quality balance
Fallback models configured for high availability
Security review completed (input sanitization, content filtering)
Load testing completed with expected traffic
Cost estimation based on expected usage
Documentation for API endpoints and error codes

Cost Planning

Estimate your costs:

Expected requests/day: _____
Avg tokens per request (input + output): _____
Model neurons cost: _____
Daily neurons = requests × tokens × neurons_per_token
Daily cost = (daily neurons - 10,000 free) / 1,000 × $0.011

Example:

10,000 requests/day
500 tokens/request (avg)
Llama 3.1 8B: ~50 neurons/1K tokens
Daily neurons: 10,000 × 0.5K × 50 = 250,000
Daily cost: (250,000 - 10,000) / 1,000 × $0.011 = $2.64

Performance Targets

Time to first token: <500ms
Avg response time: <2s (streaming)
Error rate: <1%
Cache hit rate: >50% (with AI Gateway)

Common Patterns

1. RAG Pattern

// 1. Generate query embedding
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
  text: [query],
});

// 2. Search Vectorize
const results = await env.VECTORIZE.query(embeddings.data[0], {
  topK: 3,
});

// 3. Build context
const context = results.matches.map((m) => m.metadata.text).join('\n\n');

// 4. Generate response with context
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'system', content: `Context:\n${context}` },
    { role: 'user', content: query },
  ],
  stream: true,
});

2. Multi-Model Consensus

const models = [
  '@cf/meta/llama-3.1-8b-instruct',
  '@cf/qwen/qwen1.5-7b-chat-awq',
  '@hf/thebloke/mistral-7b-instruct-v0.1-awq',
];

const responses = await Promise.all(
  models.map((model) => env.AI.run(model, { prompt }))
);

// Combine or compare responses

3. Progressive Enhancement

// Start with fast model, upgrade if needed
let response = await env.AI.run('@cf/meta/llama-3.2-1b-instruct', {
  prompt,
});

// Check quality (length, coherence, etc.)
if (response.response.length < 50) {
  // Retry with better model
  response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    prompt,
  });
}

11 KiB

Raw Blame History

Cloudflare Workers AI - Best Practices

Table of Contents

Streaming Best Practices

Why Streaming is Essential

Implementation

Client-Side Handling

Error Handling

1. Rate Limit Errors (429)

2. Model Unavailable

3. Token Limit Exceeded

Cost Optimization

1. Use AI Gateway for Caching

2. Choose the Right Model

3. Limit Output Length

4. Batch Embeddings

5. Monitor Neurons Usage

Performance Optimization

1. Use Faster Models When Appropriate

2. Parallel Requests

3. Edge Caching for Static Prompts

Security

1. Never Expose API Keys

2. Input Sanitization

3. Rate Limiting Per User

4. Content Filtering

Monitoring & Observability

1. Structured Logging

2. Error Tracking

3. Performance Metrics

Production Checklist

Before Deploying

Cost Planning

Performance Targets

Common Patterns

1. RAG Pattern

2. Multi-Model Consensus

3. Progressive Enhancement

References

11 KiB Raw Blame History Unescape Escape

Cloudflare Workers AI - Best Practices

Table of Contents

Streaming Best Practices

Why Streaming is Essential

Implementation

Client-Side Handling

Error Handling

1. Rate Limit Errors (429)

2. Model Unavailable

3. Token Limit Exceeded

Cost Optimization

1. Use AI Gateway for Caching

2. Choose the Right Model

3. Limit Output Length

4. Batch Embeddings

5. Monitor Neurons Usage

Performance Optimization

1. Use Faster Models When Appropriate

2. Parallel Requests

3. Edge Caching for Static Prompts

Security

1. Never Expose API Keys

2. Input Sanitization

3. Rate Limiting Per User

4. Content Filtering

Monitoring & Observability

1. Structured Logging

2. Error Tracking

3. Performance Metrics

Production Checklist

Before Deploying

Cost Planning

Performance Targets

Common Patterns

1. RAG Pattern

2. Multi-Model Consensus

3. Progressive Enhancement

References

11 KiB

Raw Blame History