Files
2025-11-29 18:16:46 +08:00

2.1 KiB

Phase 5: Production Optimization

Objective: Optimize for performance, cost, and reliability

1. Grammar Caching Strategy

The first request compiles a grammar from your schema (~extra latency). Subsequent requests use cached grammar (24-hour TTL).

Cache Invalidation Triggers:

  • Schema structure changes
  • Tool set changes (if using tools + JSON outputs together)
  • 24 hours of non-use

Best Practices:

# ✅ Good: Finalize schema before production
CONTACT_SCHEMA = ContactInfo  # Reuse same schema

# ❌ Bad: Dynamic schema generation
def get_schema(include_phone: bool):  # Different schemas = cache misses
    if include_phone:
        class Contact(BaseModel):
            phone: str
            ...
    ...

2. Token Cost Management

Structured outputs add tokens via system prompt:

# Monitor token usage
response = client.beta.messages.parse(...)
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")

# Optimize descriptions for token efficiency
# ✅ Good: Concise but clear
name: str = Field(description="Full name")

# ❌ Excessive: Too verbose
name: str = Field(description="The complete full name of the person including first name, middle name if available, and last name")

3. Monitoring

import time
from dataclasses import dataclass

@dataclass
class StructuredOutputMetrics:
    latency_ms: float
    input_tokens: int
    output_tokens: int
    cache_hit: bool  # Infer from latency
    stop_reason: str

def track_metrics(response, start_time) -> StructuredOutputMetrics:
    latency = (time.time() - start_time) * 1000

    return StructuredOutputMetrics(
        latency_ms=latency,
        input_tokens=response.usage.input_tokens,
        output_tokens=response.usage.output_tokens,
        cache_hit=latency < 500,  # Heuristic: fast = cache hit
        stop_reason=response.stop_reason,
    )

# Track in production
metrics = track_metrics(response, start_time)
if metrics.latency_ms > 1000:
    logger.warning(f"Slow structured output: {metrics.latency_ms}ms")

Output

Production-optimized implementation with caching and monitoring.