--- name: llm-app-architecture description: Automatically applies when building LLM applications. Ensures proper async patterns for LLM calls, streaming responses, token management, retry logic, and error handling. category: ai-llm --- # LLM Application Architecture Patterns When building applications with LLM APIs (Claude, OpenAI, etc.), follow these patterns for reliable, efficient, and maintainable systems. **Trigger Keywords**: LLM, AI application, model API, Claude, OpenAI, GPT, language model, LLM call, completion, chat completion, embeddings **Agent Integration**: Used by `ml-system-architect`, `llm-app-engineer`, `agent-orchestrator-engineer`, `rag-architect`, `performance-and-cost-engineer-llm` ## ✅ Correct Pattern: Async LLM Calls ```python import httpx import anthropic from typing import AsyncIterator import asyncio class LLMClient: """Async LLM client with proper error handling.""" def __init__(self, api_key: str, timeout: int = 60): self.client = anthropic.AsyncAnthropic( api_key=api_key, timeout=httpx.Timeout(timeout, connect=5.0) ) self.model = "claude-sonnet-4-20250514" async def complete( self, prompt: str, system: str | None = None, max_tokens: int = 1024, temperature: float = 1.0 ) -> str: """ Generate completion from LLM. Args: prompt: User message content system: Optional system prompt max_tokens: Maximum tokens to generate temperature: Sampling temperature (0-1) Returns: Generated text response Raises: LLMError: If API call fails """ try: message = await self.client.messages.create( model=self.model, max_tokens=max_tokens, temperature=temperature, system=system if system else anthropic.NOT_GIVEN, messages=[{"role": "user", "content": prompt}] ) return message.content[0].text except anthropic.APITimeoutError as e: raise LLMTimeoutError("LLM request timed out") from e except anthropic.APIConnectionError as e: raise LLMConnectionError("Failed to connect to LLM API") from e except anthropic.RateLimitError as e: raise LLMRateLimitError("Rate limit exceeded") from e except anthropic.APIStatusError as e: raise LLMError(f"LLM API error: {e.status_code}") from e # Custom exceptions class LLMError(Exception): """Base LLM error.""" pass class LLMTimeoutError(LLMError): """LLM request timeout.""" pass class LLMConnectionError(LLMError): """LLM connection error.""" pass class LLMRateLimitError(LLMError): """LLM rate limit exceeded.""" pass ``` ## Streaming Responses ```python from typing import AsyncIterator async def stream_completion( self, prompt: str, system: str | None = None, max_tokens: int = 1024 ) -> AsyncIterator[str]: """ Stream completion from LLM token by token. Yields: Individual text chunks as they arrive Usage: async for chunk in client.stream_completion("Hello"): print(chunk, end="", flush=True) """ try: async with self.client.messages.stream( model=self.model, max_tokens=max_tokens, system=system if system else anthropic.NOT_GIVEN, messages=[{"role": "user", "content": prompt}] ) as stream: async for text in stream.text_stream: yield text except anthropic.APIError as e: raise LLMError(f"Streaming error: {str(e)}") from e # Use in FastAPI endpoint from fastapi import FastAPI from fastapi.responses import StreamingResponse app = FastAPI() @app.post("/stream") async def stream_endpoint(prompt: str) -> StreamingResponse: """Stream LLM response to client.""" client = LLMClient(api_key=settings.anthropic_api_key) async def generate(): async for chunk in client.stream_completion(prompt): yield chunk return StreamingResponse( generate(), media_type="text/plain" ) ``` ## Token Counting and Management ```python from anthropic import Anthropic from typing import List, Dict class TokenCounter: """Token counting utilities for LLM calls.""" def __init__(self): self.client = Anthropic() def count_tokens(self, text: str) -> int: """ Count tokens in text using Claude's tokenizer. Args: text: Input text to count Returns: Number of tokens """ return self.client.count_tokens(text) def count_message_tokens( self, messages: List[Dict[str, str]], system: str | None = None ) -> int: """ Count tokens for a full message exchange. Args: messages: List of message dicts with role and content system: Optional system prompt Returns: Total token count including message formatting overhead """ total = 0 # System prompt if system: total += self.count_tokens(system) # Messages (include role tokens) for msg in messages: total += self.count_tokens(msg["content"]) total += 4 # Overhead for role and formatting return total def estimate_cost( self, input_tokens: int, output_tokens: int, model: str = "claude-sonnet-4-20250514" ) -> float: """ Estimate cost for LLM call. Args: input_tokens: Input token count output_tokens: Output token count model: Model name Returns: Estimated cost in USD """ # Pricing as of 2025 (update as needed) pricing = { "claude-sonnet-4-20250514": { "input": 3.00 / 1_000_000, # $3/MTok "output": 15.00 / 1_000_000 # $15/MTok }, "claude-opus-4-20250514": { "input": 15.00 / 1_000_000, "output": 75.00 / 1_000_000 }, "claude-haiku-3-5-20250514": { "input": 0.80 / 1_000_000, "output": 4.00 / 1_000_000 } } rates = pricing.get(model, pricing["claude-sonnet-4-20250514"]) return (input_tokens * rates["input"]) + (output_tokens * rates["output"]) ``` ## Retry Logic with Exponential Backoff ```python import asyncio from typing import TypeVar, Callable, Any from functools import wraps import random T = TypeVar('T') def retry_with_backoff( max_retries: int = 3, base_delay: float = 1.0, max_delay: float = 60.0, exponential_base: float = 2.0, jitter: bool = True ): """ Retry decorator with exponential backoff for LLM calls. Args: max_retries: Maximum number of retry attempts base_delay: Initial delay in seconds max_delay: Maximum delay in seconds exponential_base: Base for exponential calculation jitter: Add random jitter to prevent thundering herd """ def decorator(func: Callable[..., Any]) -> Callable[..., Any]: @wraps(func) async def wrapper(*args, **kwargs) -> Any: last_exception = None for attempt in range(max_retries + 1): try: return await func(*args, **kwargs) except (LLMTimeoutError, LLMConnectionError, LLMRateLimitError) as e: last_exception = e if attempt == max_retries: raise # Calculate delay with exponential backoff delay = min( base_delay * (exponential_base ** attempt), max_delay ) # Add jitter if jitter: delay *= (0.5 + random.random() * 0.5) await asyncio.sleep(delay) raise last_exception return wrapper return decorator class RobustLLMClient(LLMClient): """LLM client with automatic retries.""" @retry_with_backoff(max_retries=3, base_delay=1.0) async def complete(self, prompt: str, **kwargs) -> str: """Complete with automatic retries.""" return await super().complete(prompt, **kwargs) ``` ## Caching and Prompt Optimization ```python from functools import lru_cache import hashlib from typing import Optional class CachedLLMClient(LLMClient): """LLM client with response caching.""" def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self._cache: dict[str, str] = {} def _cache_key(self, prompt: str, system: str | None = None) -> str: """Generate cache key from prompt and system.""" content = f"{system or ''}||{prompt}" return hashlib.sha256(content.encode()).hexdigest() async def complete( self, prompt: str, system: str | None = None, use_cache: bool = True, **kwargs ) -> str: """Complete with caching support.""" if use_cache: cache_key = self._cache_key(prompt, system) if cache_key in self._cache: return self._cache[cache_key] response = await super().complete(prompt, system=system, **kwargs) if use_cache: self._cache[cache_key] = response return response def clear_cache(self): """Clear response cache.""" self._cache.clear() # Use Claude prompt caching for repeated system prompts async def complete_with_prompt_caching( self, prompt: str, system: str, max_tokens: int = 1024 ) -> str: """ Use Claude's prompt caching for repeated system prompts. Caches system prompt on Claude's servers for 5 minutes, reducing cost for repeated calls with same system prompt. """ message = await self.client.messages.create( model=self.model, max_tokens=max_tokens, system=[ { "type": "text", "text": system, "cache_control": {"type": "ephemeral"} # Cache this } ], messages=[{"role": "user", "content": prompt}] ) return message.content[0].text ``` ## Batch Processing ```python from typing import List import asyncio async def batch_complete( self, prompts: List[str], system: str | None = None, max_concurrent: int = 5 ) -> List[str]: """ Process multiple prompts concurrently with concurrency limit. Args: prompts: List of prompts to process system: Optional system prompt max_concurrent: Maximum concurrent requests Returns: List of responses in same order as prompts """ semaphore = asyncio.Semaphore(max_concurrent) async def process_one(prompt: str) -> str: async with semaphore: return await self.complete(prompt, system=system) tasks = [process_one(p) for p in prompts] return await asyncio.gather(*tasks) # Example usage async def process_documents(): """Process multiple documents with LLM.""" client = LLMClient(api_key=settings.anthropic_api_key) documents = ["doc1 text", "doc2 text", "doc3 text"] # Process in batches of 5 results = await client.batch_complete( prompts=documents, system="Summarize this document in 2 sentences.", max_concurrent=5 ) return results ``` ## Observability and Logging ```python import logging from datetime import datetime import json logger = logging.getLogger(__name__) class ObservableLLMClient(LLMClient): """LLM client with comprehensive logging.""" async def complete(self, prompt: str, **kwargs) -> str: """Complete with observability.""" request_id = str(uuid.uuid4()) start_time = datetime.utcnow() # Log request (redact if needed) logger.info( "LLM request started", extra={ "request_id": request_id, "model": self.model, "prompt_length": len(prompt), "max_tokens": kwargs.get("max_tokens", 1024), "temperature": kwargs.get("temperature", 1.0) } ) try: response = await super().complete(prompt, **kwargs) # Count tokens counter = TokenCounter() input_tokens = counter.count_tokens(prompt) output_tokens = counter.count_tokens(response) cost = counter.estimate_cost(input_tokens, output_tokens, self.model) # Log success duration = (datetime.utcnow() - start_time).total_seconds() logger.info( "LLM request completed", extra={ "request_id": request_id, "duration_seconds": duration, "input_tokens": input_tokens, "output_tokens": output_tokens, "total_tokens": input_tokens + output_tokens, "estimated_cost_usd": cost, "response_length": len(response) } ) return response except Exception as e: # Log error duration = (datetime.utcnow() - start_time).total_seconds() logger.error( "LLM request failed", extra={ "request_id": request_id, "duration_seconds": duration, "error_type": type(e).__name__, "error_message": str(e) }, exc_info=True ) raise ``` ## ❌ Anti-Patterns ```python # ❌ Synchronous API calls in async code def complete(prompt: str) -> str: # Should be async! response = anthropic.Anthropic().messages.create(...) return response.content[0].text # ✅ Better: Use async client async def complete(prompt: str) -> str: async_client = anthropic.AsyncAnthropic() response = await async_client.messages.create(...) return response.content[0].text # ❌ No timeout client = anthropic.AsyncAnthropic() # No timeout! # ✅ Better: Set reasonable timeout client = anthropic.AsyncAnthropic( timeout=httpx.Timeout(60.0, connect=5.0) ) # ❌ No error handling async def complete(prompt: str) -> str: response = await client.messages.create(...) # Can fail! return response.content[0].text # ✅ Better: Handle specific errors async def complete(prompt: str) -> str: try: response = await client.messages.create(...) return response.content[0].text except anthropic.RateLimitError: # Handle rate limit raise LLMRateLimitError("Rate limit exceeded") except anthropic.APITimeoutError: # Handle timeout raise LLMTimeoutError("Request timed out") # ❌ No token tracking async def complete(prompt: str) -> str: return await client.messages.create(...) # No idea of cost! # ✅ Better: Track tokens and cost async def complete(prompt: str) -> str: response = await client.messages.create(...) logger.info( "LLM call", extra={ "input_tokens": response.usage.input_tokens, "output_tokens": response.usage.output_tokens, "cost": estimate_cost(response.usage) } ) return response.content[0].text # ❌ Sequential processing async def process_many(prompts: List[str]) -> List[str]: results = [] for prompt in prompts: # Sequential! result = await complete(prompt) results.append(result) return results # ✅ Better: Concurrent processing with limits async def process_many(prompts: List[str]) -> List[str]: semaphore = asyncio.Semaphore(5) # Max 5 concurrent async def process_one(prompt): async with semaphore: return await complete(prompt) return await asyncio.gather(*[process_one(p) for p in prompts]) ``` ## Best Practices Checklist - ✅ Use async/await for all LLM API calls - ✅ Set reasonable timeouts (30-60 seconds) - ✅ Implement retry logic with exponential backoff - ✅ Handle specific API exceptions (rate limit, timeout, connection) - ✅ Track token usage and estimated cost - ✅ Log all LLM calls with request IDs - ✅ Use streaming for long responses - ✅ Implement prompt caching for repeated system prompts - ✅ Process multiple requests concurrently with semaphores - ✅ Redact sensitive data in logs - ✅ Set max_tokens to prevent runaway costs - ✅ Use appropriate temperature for task (0 for deterministic, 1 for creative) ## Auto-Apply When making LLM API calls: 1. Use async client (AsyncAnthropic, AsyncOpenAI) 2. Add timeout configuration 3. Implement retry logic for transient errors 4. Track tokens and cost 5. Log requests with structured logging 6. Use streaming for real-time responses 7. Handle rate limits gracefully ## Related Skills - `async-await-checker` - For async/await patterns - `structured-errors` - For error handling - `observability-logging` - For logging and tracing - `pydantic-models` - For request/response validation - `fastapi-patterns` - For building LLM API endpoints