gh-cskiro-claudex-api-tools/skills/json-outputs-implementer/workflow/phase-5-production.md

# Phase 5: Production Optimization

**Objective**: Optimize for performance, cost, and reliability

## 1. Grammar Caching Strategy

The first request compiles a grammar from your schema (~extra latency). Subsequent requests use cached grammar (24-hour TTL).

**Cache Invalidation Triggers:**
- Schema structure changes
- Tool set changes (if using tools + JSON outputs together)
- 24 hours of non-use

**Best Practices:**
```python
# ✅ Good: Finalize schema before production
CONTACT_SCHEMA = ContactInfo  # Reuse same schema

# ❌ Bad: Dynamic schema generation
def get_schema(include_phone: bool):  # Different schemas = cache misses
    if include_phone:
        class Contact(BaseModel):
            phone: str
            ...
    ...
```

## 2. Token Cost Management

Structured outputs add tokens via system prompt:
```python
# Monitor token usage
response = client.beta.messages.parse(...)
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")

# Optimize descriptions for token efficiency
# ✅ Good: Concise but clear
name: str = Field(description="Full name")

# ❌ Excessive: Too verbose
name: str = Field(description="The complete full name of the person including first name, middle name if available, and last name")
```

## 3. Monitoring

```python
import time
from dataclasses import dataclass

@dataclass
class StructuredOutputMetrics:
    latency_ms: float
    input_tokens: int
    output_tokens: int
    cache_hit: bool  # Infer from latency
    stop_reason: str

def track_metrics(response, start_time) -> StructuredOutputMetrics:
    latency = (time.time() - start_time) * 1000

    return StructuredOutputMetrics(
        latency_ms=latency,
        input_tokens=response.usage.input_tokens,
        output_tokens=response.usage.output_tokens,
        cache_hit=latency < 500,  # Heuristic: fast = cache hit
        stop_reason=response.stop_reason,
    )

# Track in production
metrics = track_metrics(response, start_time)
if metrics.latency_ms > 1000:
    logger.warning(f"Slow structured output: {metrics.latency_ms}ms")
```

## Output

Production-optimized implementation with caching and monitoring.