# Performance Optimization for Simulations **When to use this skill**: When simulations run below target frame rate (typically 60 FPS for PC, 30 FPS for mobile), especially with large agent counts (100+ units), complex AI, physics calculations, or proximity queries. Critical for RTS games, crowd simulations, ecosystem models, traffic systems, and any scenario requiring 1000+ active entities. **What this skill provides**: Systematic methodology for performance optimization using profiling-driven decisions, spatial partitioning patterns, level-of-detail (LOD) systems, time-slicing, caching strategies, data-oriented design, and selective multithreading. Focuses on achieving 60 FPS at scale while maintaining gameplay quality. ## Core Concepts ### The Optimization Hierarchy (Critical Order) **ALWAYS optimize in this order** - each level provides 10-100× improvement: 1. **PROFILE FIRST** (0.5-1 hour investment) - Identify actual bottleneck with profiler - Measure baseline performance - Set target frame time budgets - **Never guess** - 80% of time is usually in 20% of code 2. **Algorithmic Optimizations** (10-100× improvement) - Fix O(n²) → O(n) or O(n log n) - Spatial partitioning for proximity queries - Replace brute-force with smart algorithms - **Biggest wins**, do these FIRST 3. **Level of Detail (LOD)** (2-10× improvement) - Reduce computation for distant/unimportant entities - Smooth transitions (no popping) - Priority-based update frequencies - Behavior LOD + visual LOD 4. **Time-Slicing** (2-5× improvement) - Spread work across multiple frames - Frame time budgets per system - Priority queues for important work - Amortized expensive operations 5. **Caching** (2-10× improvement) - Avoid redundant calculations - LRU eviction + TTL - Proper invalidation - Bounded memory usage 6. **Data-Oriented Design** (1.5-3× improvement) - Cache-friendly memory layouts - Struct of Arrays (SoA) vs Array of Structs (AoS) - Minimize pointer chasing - Batch operations on contiguous data 7. **Multithreading** (1.5-4× improvement) - ONLY if still needed after above - Job systems for data parallelism - Avoid locks and race conditions - Complexity cost is high **Example**: RTS with 1000 units at 10 FPS → 60 FPS - Profile: Vision checks are 80% of frame time - Spatial partitioning: O(n²) → O(n) = 50× faster → 40 FPS - LOD: Distant units update less = 1.5× faster → 60 FPS - Done in 30 minutes vs 2 hours of trial-and-error ### Profiling Methodology **Three-step profiling process**: 1. **Capture Baseline** (before optimization) - Total frame time - Time per major system (AI, physics, rendering, pathfinding) - CPU vs GPU bound - Memory allocations per frame - Cache misses (if profiler supports) 2. **Identify Bottleneck** (80/20 rule) - Sort functions by time spent - Focus on top 3-5 functions (usually 80% of time) - Understand WHY they're slow (algorithm, data layout, cache misses) 3. **Validate Improvement** (after each optimization) - Measure same metrics - Calculate speedup ratio - Check for regressions (new bottlenecks) - Iterate until target met **Profiling Tools**: - **Python**: cProfile, line_profiler, memory_profiler, py-spy - **C++**: VTune, perf, Instruments (Mac), Very Sleepy - **Unity**: Unity Profiler, Deep Profile mode - **Unreal**: Unreal Insights, stat commands - **Browser**: Chrome DevTools Performance tab **Example Profiling Output**: ``` Total frame time: 100ms (10 FPS) Function Time % of Frame ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ update_vision_checks() 80ms 80% ← BOTTLENECK update_ai() 10ms 10% update_pathfinding() 5ms 5% update_physics() 3ms 3% render() 2ms 2% Diagnosis: O(n²) vision checks (1000 units × 1000 = 1M checks/frame) Solution: Spatial partitioning → O(n) checks ``` ### Spatial Partitioning **Problem**: Proximity queries are O(n²) when checking every entity against every other - 100 entities = 10,000 checks - 1,000 entities = 1,000,000 checks (death) - 10,000 entities = 100,000,000 checks (impossible) **Solution**: Divide space into regions, only check entities in nearby regions **Spatial Hash Grid** (simplest, fastest for uniform distribution) - Divide world into fixed-size cells (e.g., 50×50 units) - Hash entity position to cell(s) - Query: Check only entities in neighboring cells - Complexity: O(n) to build, O(1) average query - Best for: Mostly uniform entity distribution **Quadtree** (adaptive, good for clustered entities) - Recursively subdivide space into 4 quadrants - Split when cell exceeds threshold (e.g., 10 entities) - Query: Descend tree, check overlapping nodes - Complexity: O(n log n) to build, O(log n) average query - Best for: Entities clustered in areas **Octree** (3D version of quadtree) - Recursively subdivide 3D space into 8 octants - Same benefits as quadtree for 3D worlds - Best for: 3D flight sims, space games, underwater **Decision Framework**: ``` Spatial Partitioning Choice: ├─ 2D WORLD with UNIFORM DISTRIBUTION? │ └─ Use Spatial Hash Grid (simplest, fastest) │ ├─ 2D WORLD with CLUSTERED ENTITIES? │ └─ Use Quadtree (adapts to density) │ ├─ 3D WORLD? │ └─ Use Octree (3D quadtree) │ └─ VERY LARGE WORLD (multiple km²)? └─ Use Hierarchical Grid (multiple grids at different scales) ``` **Performance Impact**: - 1000 units: O(n²) = 1,000,000 checks → O(n) = 1,000 checks = **1000× faster** - Typical speedup: 50-100× in practice (accounting for grid overhead) ### Level of Detail (LOD) **Concept**: Reduce computation for entities that don't need full precision **Distance-Based LOD Levels**: - **LOD 0** (0-50 units from camera): Full detail - Full AI decision-making (10 Hz) - Precise pathfinding - Detailed animations - All visual effects - **LOD 1** (50-100 units): Reduced detail - Simplified AI (5 Hz) - Coarse pathfinding (waypoints only) - Simplified animations - Reduced effects - **LOD 2** (100-200 units): Minimal detail - Basic AI (1 Hz) - Straight-line movement - Static pose or simple animation - No effects - **LOD 3** (200+ units): Culled or dormant - State update only (0.2 Hz) - No pathfinding - Billboards or invisible - No physics **Importance-Based LOD** (better than distance alone): ```python def calculate_lod_level(entity, camera, player): # Multiple factors determine importance distance = entity.distance_to(camera) is_player_unit = entity.team == player.team is_in_combat = entity.in_combat is_selected = entity in player.selection # Important entities always get high LOD if is_selected: return 0 # Always full detail if is_player_unit and is_in_combat: return 0 # Player's units in combat = critical # Distance-based for others if distance < 50: return 0 elif distance < 100: return 1 elif distance < 200: return 2 else: return 3 ``` **Smooth LOD Transitions** (avoid popping): - **Hysteresis**: Different thresholds for upgrading vs downgrading - Upgrade LOD at 90 units - Downgrade LOD at 110 units - 20-unit buffer prevents thrashing - **Time delay**: Wait N seconds before downgrading LOD - Prevents rapid flicker at boundary - **Blend animations**: Cross-fade between LOD levels - 0.5-1 second blend **Behavior LOD Examples**: | System | LOD 0 (Full) | LOD 1 (Reduced) | LOD 2 (Minimal) | LOD 3 (Dormant) | |--------|--------------|-----------------|-----------------|-----------------| | **AI** | Behavior tree 10 Hz | Simple FSM 5 Hz | Follow path 1 Hz | State only 0.2 Hz | | **Pathfinding** | Full A* | Hierarchical | Straight line | None | | **Vision** | 360° scan 10 Hz | Forward cone 5 Hz | None | None | | **Physics** | Full collision | Bounding box | None | None | | **Animation** | Full skeleton | 5 bones | Static pose | None | | **Audio** | 3D positioned | 2D ambient | None | None | **Performance Impact**: - 1000 units: 100% at LOD 0 vs 20% at LOD 0 + 80% at LOD 1-3 = **3-5× faster** ### Time-Slicing **Concept**: Spread expensive operations across multiple frames to stay within frame budget **Frame Time Budget** (60 FPS = 16.67ms per frame): ``` Frame Budget (16.67ms total): ├─ Rendering: 6ms (40%) ├─ AI: 4ms (24%) ├─ Physics: 3ms (18%) ├─ Pathfinding: 2ms (12%) └─ Other: 1.67ms (10%) ``` **Time-Slicing Pattern 1: Fixed Budget Per Frame** ```python class TimeSlicedSystem: def __init__(self, budget_ms=2.0): self.budget = budget_ms self.pending_work = [] def add_work(self, work_item, priority=0): # Priority queue: higher priority = processed first heapq.heappush(self.pending_work, (-priority, work_item)) def update(self, dt): start_time = time.time() processed = 0 while self.pending_work and (time.time() - start_time) < self.budget: priority, work_item = heapq.heappop(self.pending_work) work_item.execute() processed += 1 return processed # Usage: Pathfinding pathfinding_system = TimeSlicedSystem(budget_ms=2.0) for unit in units_needing_paths: priority = calculate_priority(unit) # Player units = high priority pathfinding_system.add_work(PathfindRequest(unit), priority) # Each frame: process as many as fit in 2ms budget paths_found = pathfinding_system.update(dt) ``` **Time-Slicing Pattern 2: Amortized Updates** ```python class AmortizedUpdateManager: def __init__(self, entities, updates_per_frame=200): self.entities = entities self.updates_per_frame = updates_per_frame self.current_index = 0 def update(self, dt): # Update N entities per frame for i in range(self.updates_per_frame): entity = self.entities[self.current_index] entity.expensive_update(dt) self.current_index = (self.current_index + 1) % len(self.entities) # All entities updated every N frames # 1000 entities / 200 per frame = every 5 frames = 12 Hz at 60 FPS # Priority-based amortization def update_with_priority(entities, frame_count): for i, entity in enumerate(entities): # Distance-based update frequency distance = entity.distance_to_camera() if distance < 50: entity.update() # Every frame (60 Hz) elif distance < 100 and frame_count % 2 == 0: entity.update() # Every 2 frames (30 Hz) elif distance < 200 and frame_count % 5 == 0: entity.update() # Every 5 frames (12 Hz) elif frame_count % 30 == 0: entity.update() # Every 30 frames (2 Hz) ``` **Time-Slicing Pattern 3: Incremental Processing** ```python class IncrementalPathfinder: """Find path over multiple frames instead of blocking""" def __init__(self, max_nodes_per_frame=100): self.max_nodes = max_nodes_per_frame self.open_set = [] self.closed_set = set() self.current_request = None def start_pathfind(self, start, goal): self.current_request = PathRequest(start, goal) heapq.heappush(self.open_set, (0, start)) return self.current_request def step(self): """Process up to max_nodes this frame, return True if done""" if not self.current_request: return True nodes_processed = 0 while self.open_set and nodes_processed < self.max_nodes: current = heapq.heappop(self.open_set) if current == self.current_request.goal: self.current_request.path = reconstruct_path(current) self.current_request.complete = True return True # Expand neighbors... nodes_processed += 1 return False # Not done yet, continue next frame # Usage pathfinder = IncrementalPathfinder(max_nodes_per_frame=100) request = pathfinder.start_pathfind(unit.pos, target.pos) # Each frame while not request.complete: pathfinder.step() # Process 100 nodes, spread over multiple frames ``` **Performance Impact**: - 1000 expensive updates: 1000/frame → 200/frame = **5× faster** - Pathfinding: Blocking 50ms → 2ms budget = stays at 60 FPS ### Caching Strategies **When to Cache**: - Expensive calculations used repeatedly (pathfinding, line-of-sight) - Results that change infrequently (static paths, terrain visibility) - Deterministic results (same input = same output) **Cache Design Pattern**: ```python class PerformanceCache: def __init__(self, max_size=10000, ttl_seconds=60.0): self.cache = {} # key -> CacheEntry self.max_size = max_size self.ttl = ttl_seconds self.access_times = {} # LRU tracking self.insert_times = {} # TTL tracking def get(self, key): current_time = time.time() if key not in self.cache: return None # Check TTL (time-to-live) if current_time - self.insert_times[key] > self.ttl: del self.cache[key] del self.access_times[key] del self.insert_times[key] return None # Update LRU self.access_times[key] = current_time return self.cache[key] def put(self, key, value): current_time = time.time() # Evict if full (LRU eviction) if len(self.cache) >= self.max_size: # Find least recently used lru_key = min(self.access_times, key=self.access_times.get) del self.cache[lru_key] del self.access_times[lru_key] del self.insert_times[lru_key] self.cache[key] = value self.access_times[key] = current_time self.insert_times[key] = current_time def invalidate(self, key): """Explicit invalidation when data changes""" if key in self.cache: del self.cache[key] del self.access_times[key] del self.insert_times[key] def invalidate_region(self, x, y, radius): """Invalidate all cache entries in region (e.g., terrain changed)""" keys_to_remove = [] for key in self.cache: if self._key_in_region(key, x, y, radius): keys_to_remove.append(key) for key in keys_to_remove: self.invalidate(key) # Usage: Path caching path_cache = PerformanceCache(max_size=5000, ttl_seconds=30.0) def get_or_calculate_path(start, goal): # Quantize to grid for cache key (allow slight position variance) key = (round(start.x), round(start.y), round(goal.x), round(goal.y)) cached = path_cache.get(key) if cached: return cached # Cache hit! # Cache miss - calculate path = expensive_pathfinding(start, goal) path_cache.put(key, path) return path # Invalidate when terrain changes def on_building_placed(x, y): path_cache.invalidate_region(x, y, radius=100) ``` **Cache Invalidation Strategies**: 1. **Time-To-Live (TTL)**: Expire after N seconds - Good for: Dynamic environments (traffic, weather) - Example: Path cache with 30 second TTL 2. **Event-Based**: Invalidate on specific events - Good for: Known change triggers (building placed, obstacle moved) - Example: Invalidate paths when wall built 3. **Hybrid**: TTL + event-based - Good for: Most scenarios - Example: 60 second TTL OR invalidate on terrain change **Performance Impact**: - Pathfinding with 60% cache hit rate: 40% of requests calculate = **2.5× faster** - Line-of-sight with 80% cache hit rate: 20% of requests calculate = **5× faster** ### Data-Oriented Design (DOD) **Concept**: Organize data for cache-friendly access patterns **Array of Structs (AoS)** - Traditional OOP approach: ```python class Unit: def __init__(self): self.x = 0.0 self.y = 0.0 self.health = 100 self.damage = 10 # ... 20 more fields ... units = [Unit() for _ in range(1000)] # Update positions (cache-unfriendly) for unit in units: unit.x += unit.velocity_x * dt # Load entire Unit struct for each unit unit.y += unit.velocity_y * dt # Only using 2 fields, wasting cache ``` **Struct of Arrays (SoA)** - DOD approach: ```python class UnitSystem: def __init__(self, count): # Separate arrays for each component self.positions_x = [0.0] * count self.positions_y = [0.0] * count self.velocities_x = [0.0] * count self.velocities_y = [0.0] * count self.health = [100] * count self.damage = [10] * count # ... more arrays ... units = UnitSystem(1000) # Update positions (cache-friendly) for i in range(len(units.positions_x)): units.positions_x[i] += units.velocities_x[i] * dt # Sequential memory access units.positions_y[i] += units.velocities_y[i] * dt # Perfect for CPU cache ``` **Why SoA is Faster**: - CPU cache lines are 64 bytes - AoS: Load 1-2 units per cache line (if Unit is 32-64 bytes) - SoA: Load 8-16 floats per cache line (4 bytes each) - **4-8× better cache utilization** = 1.5-3× faster in practice **When to Use SoA**: - Batch operations on many entities (position updates, damage calculations) - Systems that only need 1-2 fields from entity - Performance-critical inner loops **When AoS is Okay**: - Small entity counts (< 100) - Operations needing many fields - Prototyping (DOD is optimization, not default) **ECS Architecture** (combines SoA + component composition): ```python # Components (pure data) class Position: x: float y: float class Velocity: x: float y: float class Health: current: int max: int # Systems (pure logic) class MovementSystem: def update(self, positions, velocities, dt): # Batch process all entities with Position + Velocity for i in range(len(positions)): positions[i].x += velocities[i].x * dt positions[i].y += velocities[i].y * dt class CombatSystem: def update(self, positions, health, attacks): # Only process entities with Position + Health + Attack # ... # Entity is just an ID entities = [Entity(id=i) for i in range(1000)] ``` **Performance Impact**: - Cache-friendly data layout: 1.5-3× faster for batch operations - ECS architecture: Enables efficient multithreading (no shared mutable state) ### Multithreading (Use Sparingly) **When to Multithread**: - ✅ After all other optimizations (if still needed) - ✅ Embarrassingly parallel work (no dependencies) - ✅ Long-running tasks (benefit outweighs overhead) - ✅ Native code (C++, Rust) - avoids GIL **When NOT to Multithread**: - ❌ Python CPU-bound code (GIL limits to 1 core) - ❌ Before trying simpler optimizations - ❌ Lots of shared mutable state (locking overhead) - ❌ Small tasks (thread overhead > savings) **Job System Pattern** (best practice): ```python from concurrent.futures import ThreadPoolExecutor import threading class JobSystem: def __init__(self, num_workers=4): self.executor = ThreadPoolExecutor(max_workers=num_workers) def submit_batch(self, jobs): """Submit list of independent jobs, return futures""" futures = [self.executor.submit(job.execute) for job in jobs] return futures def wait_all(self, futures): """Wait for all jobs to complete""" results = [future.result() for future in futures] return results # Good: Parallel pathfinding (independent tasks) job_system = JobSystem(num_workers=4) path_jobs = [PathfindJob(unit.pos, unit.target) for unit in units_needing_paths] futures = job_system.submit_batch(path_jobs) # Do other work while pathfinding runs... # Collect results paths = job_system.wait_all(futures) ``` **Data Parallelism Pattern** (no shared mutable state): ```python def update_positions_parallel(positions, velocities, dt, num_workers=4): """Update positions in parallel batches""" def update_batch(start_idx, end_idx): # Each worker gets exclusive slice (no locks needed) for i in range(start_idx, end_idx): positions[i].x += velocities[i].x * dt positions[i].y += velocities[i].y * dt # Split work into batches batch_size = len(positions) // num_workers futures = [] for worker_id in range(num_workers): start = worker_id * batch_size end = start + batch_size if worker_id < num_workers - 1 else len(positions) future = executor.submit(update_batch, start, end) futures.append(future) # Wait for all workers for future in futures: future.result() ``` **Common Multithreading Pitfalls**: 1. **Race Conditions** (shared mutable state) ```python # BAD: Multiple threads modifying same list for unit in units: threading.Thread(target=unit.update, args=(all_units,)).start() # Each thread reads/writes all_units = data race! # GOOD: Read-only shared data for unit in units: # units is read-only for all threads # Each unit only modifies itself (exclusive ownership) threading.Thread(target=unit.update, args=(units,)).start() ``` 2. **False Sharing** (cache line contention) ```python # BAD: Adjacent array elements on same cache line shared_counters = [0] * 8 # 8 threads updating 8 counters # Thread 0 updates counter[0], Thread 1 updates counter[1] # Both on same 64-byte cache line = cache thrashing! # GOOD: Pad to separate cache lines class PaddedCounter: value: int padding: [int] * 15 # Force to own cache line shared_counters = [PaddedCounter() for _ in range(8)] ``` 3. **Excessive Locking** (defeats parallelism) ```python # BAD: Single lock for everything lock = threading.Lock() def update_unit(unit): with lock: # Only 1 thread can work at a time! unit.update() # GOOD: Lock-free or fine-grained locking def update_unit(unit): unit.update() # Each unit independent, no lock needed ``` **Performance Impact**: - 4 cores: Ideal speedup = 4×, realistic = 2-3× (overhead, Amdahl's law) - Python: Minimal (GIL), use multiprocessing or native extensions - C++/Rust: Good (2-3× on 4 cores for parallelizable work) ## Decision Frameworks ### Framework 1: Systematic Optimization Process **Use this process EVERY time performance is inadequate**: ``` Step 1: PROFILE (mandatory, do first) ├─ Capture baseline metrics ├─ Identify top 3-5 bottlenecks (80% of time) └─ Understand WHY slow (algorithm, data, cache) Step 2: ALGORITHMIC (10-100× gains) ├─ Is bottleneck O(n²) or worse? │ ├─ Proximity queries? → Spatial partitioning │ ├─ Pathfinding? → Hierarchical, flow fields, or caching │ └─ Sorting? → Better algorithm or less frequent ├─ Is bottleneck doing redundant work? │ └─ Add caching with LRU + TTL └─ Measure improvement, re-profile Step 3: LOD (2-10× gains) ├─ Can distant entities use less detail? │ ├─ Distance-based LOD levels (4 levels) │ ├─ Importance weighting (player units > NPC) │ └─ Smooth transitions (hysteresis, blending) └─ Measure improvement, re-profile Step 4: TIME-SLICING (2-5× gains) ├─ Can work spread across multiple frames? │ ├─ Set frame budget per system (2-4ms typical) │ ├─ Priority queue (important work first) │ └─ Amortized updates (N entities per frame) └─ Measure improvement, re-profile Step 5: DATA-ORIENTED DESIGN (1.5-3× gains) ├─ Is bottleneck cache-unfriendly? │ ├─ Convert AoS → SoA for batch operations │ ├─ Group hot data together │ └─ Minimize pointer chasing └─ Measure improvement, re-profile Step 6: MULTITHREADING (1.5-4× gains, high complexity) ├─ Still below target after above? │ ├─ Identify embarrassingly parallel work │ ├─ Job system for independent tasks │ ├─ Data parallelism (no shared mutable state) │ └─ Avoid locks (lock-free or per-entity ownership) └─ Measure improvement, re-profile Step 7: VALIDATE ├─ Met target frame rate? → Done! ├─ Still slow? → Return to Step 1, find new bottleneck └─ Regression? → Revert and try different approach ``` **Example Application** (1000-unit RTS at 10 FPS): 1. Profile: Vision checks are 80% (80ms/100ms frame) 2. Algorithmic: Add spatial hash grid → 40 FPS (15ms vision checks) 3. LOD: Distant units update at 5 Hz → 55 FPS (11ms vision) 4. Time-slicing: 2ms pathfinding budget → 60 FPS ✅ **Done** 5. (Skip DOD and multithreading - already at target) ### Framework 2: Choosing Spatial Partitioning ``` START: What's my proximity query scenario? ├─ 2D WORLD with UNIFORM ENTITY DISTRIBUTION? │ └─ Use SPATIAL HASH GRID │ - Cell size = 2× query radius (e.g., vision range 50 → cells 100×100) │ - O(n) build, O(1) query │ - Simplest to implement │ - Example: RTS units on open battlefield │ ├─ 2D WORLD with CLUSTERED ENTITIES? │ └─ Use QUADTREE │ - Split threshold = 10-20 entities per node │ - Max depth = 8-10 levels │ - O(n log n) build, O(log n) query │ - Example: City simulation (dense downtown, sparse suburbs) │ ├─ 3D WORLD? │ └─ Use OCTREE │ - Same as quadtree, but 8 children per node │ - Example: Space game, underwater sim │ ├─ VERY LARGE WORLD (> 10 km²)? │ └─ Use HIERARCHICAL GRID │ - Coarse grid (1km cells) + fine grid (50m cells) per coarse cell │ - Example: MMO world, open-world game │ └─ ENTITIES MOSTLY STATIONARY? └─ Use STATIC QUADTREE/OCTREE - Build once, query many times - Example: Building placement, static obstacles ``` **Implementation Complexity**: - Spatial Hash Grid: **1-2 hours** (simple) - Quadtree: **3-5 hours** (moderate) - Octree: **4-6 hours** (moderate) - Hierarchical Grid: **6-10 hours** (complex) **Performance Characteristics**: | Method | Build Time | Query Time | Memory | Best For | |--------|------------|------------|--------|----------| | Hash Grid | O(n) | O(1) avg | Low | Uniform distribution | | Quadtree | O(n log n) | O(log n) avg | Medium | Clustered entities | | Octree | O(n log n) | O(log n) avg | Medium | 3D worlds | | Hierarchical | O(n) | O(1) avg | Higher | Massive worlds | ### Framework 3: LOD Level Assignment ``` For each entity, assign LOD level based on: ├─ IMPORTANCE (highest priority) │ ├─ Player-controlled? → LOD 0 (always full detail) │ ├─ Player's team AND in combat? → LOD 0 │ ├─ Selected units? → LOD 0 │ ├─ Quest-critical NPCs? → LOD 0 │ └─ Otherwise, use distance-based... │ ├─ DISTANCE FROM CAMERA (secondary) │ ├─ 0-50 units → LOD 0 (full detail) │ │ - Update: 60 Hz (every frame) │ │ - AI: Full behavior tree │ │ - Pathfinding: Precise A* │ │ - Animation: Full skeleton │ │ │ ├─ 50-100 units → LOD 1 (reduced) │ │ - Update: 30 Hz (every 2 frames) │ │ - AI: Simplified FSM │ │ - Pathfinding: Hierarchical │ │ - Animation: 10 bones │ │ │ ├─ 100-200 units → LOD 2 (minimal) │ │ - Update: 12 Hz (every 5 frames) │ │ - AI: Basic scripted │ │ - Pathfinding: Waypoints │ │ - Animation: Static pose │ │ │ └─ 200+ units → LOD 3 (culled) │ - Update: 2 Hz (every 30 frames) │ - AI: State only (no decisions) │ - Pathfinding: None │ - Animation: None (invisible or billboard) │ └─ SCREEN SIZE (tertiary) ├─ Occluded or < 5 pixels? → LOD 3 (culled) └─ Small on screen? → Bump LOD down 1 level ``` **Hysteresis to Prevent LOD Thrashing**: ```python # Without hysteresis (bad - flickers) lod = 0 if distance < 100 else 1 # Entity at 99-101 units: LOD flip-flops every frame! # With hysteresis (good - stable) if distance < 90: lod = 0 # Upgrade at 90 elif distance > 110: lod = 1 # Downgrade at 110 # else: keep current LOD # 20-unit buffer prevents thrashing ``` ### Framework 4: When to Use Multithreading ``` Should I multithread this system? ├─ ALREADY optimized algorithmic/LOD/caching? │ └─ NO → Do those FIRST (10-100× gains vs 2-4× for threading) │ ├─ WORK IS EMBARRASSINGLY PARALLEL? │ ├─ Independent tasks (pathfinding requests)? → YES, good candidate │ ├─ Lots of shared mutable state? → NO, locking kills performance │ └─ Need results immediately? → NO, adds latency │ ├─ TASK DURATION > 1ms? │ ├─ YES → Threading overhead is small % of work │ └─ NO → Overhead dominates, not worth it │ ├─ PYTHON or NATIVE CODE? │ ├─ Python → Use multiprocessing (avoid GIL) or native extensions │ └─ C++/Rust → ThreadPool or job system works well │ ├─ COMPLEXITY COST JUSTIFIED? │ ├─ Can maintain code with debugging difficulty? → Consider it │ └─ Team inexperienced with threading? → Avoid (bugs are costly) │ └─ EXPECTED SPEEDUP > 1.5×? ├─ 4 cores: Realistic = 2-3× (not 4× due to overhead) ├─ Worth complexity? → Your call └─ Not worth it? → Try other optimizations first ``` **Threading Decision Tree Example**: ``` Scenario: Pathfinding for 100 units ├─ Already using caching? YES (60% hit rate) ├─ Work is parallel? YES (each path independent) ├─ Task duration? 5ms per path (good for threading) ├─ Language? Python (GIL problem) │ └─ Solution: Use multiprocessing or native pathfinding library ├─ Complexity justified? 100 paths × 5ms = 500ms → 60ms with 8 workers │ └─ YES, worth it (8× speedup) │ Decision: Use multiprocessing.Pool with 8 workers ``` ### Framework 5: Frame Time Budget Allocation **60 FPS = 16.67ms per frame, 30 FPS = 33.33ms per frame** **Budget Template** (adjust based on game type): ``` 60 FPS Frame Budget (16.67ms total): ├─ Rendering: 6.0ms (40%) │ ├─ Culling: 1.0ms │ ├─ Draw calls: 4.0ms │ └─ Post-processing: 1.0ms │ ├─ AI: 3.5ms (24%) │ ├─ Behavior trees: 2.0ms │ ├─ Sensors/perception: 1.0ms │ └─ Decision-making: 0.5ms │ ├─ Physics: 3.0ms (18%) │ ├─ Broad-phase: 0.5ms │ ├─ Narrow-phase: 1.5ms │ └─ Constraint solving: 1.0ms │ ├─ Pathfinding: 2.0ms (12%) │ ├─ New paths: 1.5ms │ └─ Path following: 0.5ms │ ├─ Gameplay: 1.0ms (6%) │ ├─ Economy updates: 0.3ms │ ├─ Event processing: 0.4ms │ └─ UI updates: 0.3ms │ └─ Buffer: 1.17ms (7%) └─ Unexpected spikes, GC, etc. ``` **Budget by Game Type**: | Game Type | Rendering | AI | Physics | Pathfinding | Gameplay | |-----------|-----------|-----|---------|-------------|----------| | **RTS** | 30% | 30% | 10% | 20% | 10% | | **FPS** | 50% | 15% | 20% | 5% | 10% | | **City Builder** | 35% | 20% | 5% | 15% | 25% | | **Physics Sim** | 30% | 5% | 50% | 5% | 10% | | **Turn-Based** | 60% | 15% | 5% | 10% | 10% | **Enforcement Pattern**: ```python class FrameBudgetMonitor: def __init__(self): self.budgets = { 'rendering': 6.0, 'ai': 3.5, 'physics': 3.0, 'pathfinding': 2.0, 'gameplay': 1.0 } self.measurements = {key: [] for key in self.budgets} def measure(self, system_name, func): start = time.perf_counter() result = func() elapsed_ms = (time.perf_counter() - start) * 1000 self.measurements[system_name].append(elapsed_ms) # Alert if over budget if elapsed_ms > self.budgets[system_name]: print(f"⚠️ {system_name} over budget: {elapsed_ms:.2f}ms / {self.budgets[system_name]:.2f}ms") return result def report(self): print("Frame Time Budget Report:") for system, budget in self.budgets.items(): avg = sum(self.measurements[system]) / len(self.measurements[system]) pct = (avg / budget) * 100 print(f" {system}: {avg:.2f}ms / {budget:.2f}ms ({pct:.0f}%)") # Usage monitor = FrameBudgetMonitor() def game_loop(): monitor.measure('ai', lambda: update_ai(units)) monitor.measure('physics', lambda: update_physics(world)) monitor.measure('pathfinding', lambda: update_pathfinding(units)) monitor.measure('rendering', lambda: render_scene(camera)) if frame_count % 300 == 0: # Every 5 seconds monitor.report() ``` ## Implementation Patterns ### Pattern 1: Spatial Hash Grid for Proximity Queries **Problem**: Checking every unit against every other unit for vision/attack is O(n²) - 1000 units = 1,000,000 checks per frame = death **Solution**: Spatial hash grid divides world into cells, only check nearby cells ```python import math from collections import defaultdict class SpatialHashGrid: """ Spatial partitioning using hash grid for O(1) average query time. Best for: Uniform entity distribution, 2D worlds Cell size rule: 2× maximum query radius """ def __init__(self, cell_size=100): self.cell_size = cell_size self.grid = defaultdict(list) # (cell_x, cell_y) -> [entities] def _hash(self, x, y): """Convert world position to cell coordinates""" cell_x = int(math.floor(x / self.cell_size)) cell_y = int(math.floor(y / self.cell_size)) return (cell_x, cell_y) def clear(self): """Clear all entities (call at start of frame)""" self.grid.clear() def insert(self, entity): """Insert entity into grid""" cell = self._hash(entity.x, entity.y) self.grid[cell].append(entity) def query_radius(self, x, y, radius): """ Find all entities within radius of (x, y). Returns: List of entities in range Complexity: O(k) where k = entities in nearby cells (typically 10-50) """ # Calculate which cells to check min_cell_x = int(math.floor((x - radius) / self.cell_size)) max_cell_x = int(math.floor((x + radius) / self.cell_size)) min_cell_y = int(math.floor((y - radius) / self.cell_size)) max_cell_y = int(math.floor((y + radius) / self.cell_size)) candidates = [] # Check all cells in range for cell_x in range(min_cell_x, max_cell_x + 1): for cell_y in range(min_cell_y, max_cell_y + 1): cell = (cell_x, cell_y) candidates.extend(self.grid.get(cell, [])) # Filter by exact distance (candidates may be outside radius) results = [] radius_sq = radius * radius for entity in candidates: dx = entity.x - x dy = entity.y - y dist_sq = dx * dx + dy * dy if dist_sq <= radius_sq: results.append(entity) return results def query_rect(self, min_x, min_y, max_x, max_y): """Find all entities in rectangular region""" min_cell_x = int(math.floor(min_x / self.cell_size)) max_cell_x = int(math.floor(max_x / self.cell_size)) min_cell_y = int(math.floor(min_y / self.cell_size)) max_cell_y = int(math.floor(max_y / self.cell_size)) results = [] for cell_x in range(min_cell_x, max_cell_x + 1): for cell_y in range(min_cell_y, max_cell_y + 1): cell = (cell_x, cell_y) results.extend(self.grid.get(cell, [])) return results # Usage Example class Unit: def __init__(self, x, y, team): self.x = x self.y = y self.team = team self.vision_range = 50 self.attack_range = 20 def game_loop(): units = [Unit(random() * 1000, random() * 1000, random_team()) for _ in range(1000)] # Cell size = 2× max query radius (vision range) spatial_grid = SpatialHashGrid(cell_size=100) while running: # Rebuild grid each frame (units move) spatial_grid.clear() for unit in units: spatial_grid.insert(unit) # Update units for unit in units: # OLD (O(n²)): Check all 1000 units = 1,000,000 checks # enemies = [u for u in units if u.team != unit.team and distance(u, unit) < vision_range] # NEW (O(k)): Check ~10-50 units in nearby cells nearby = spatial_grid.query_radius(unit.x, unit.y, unit.vision_range) enemies = [u for u in nearby if u.team != unit.team] # Attack enemies in range for enemy in enemies: dist_sq = (unit.x - enemy.x)**2 + (unit.y - enemy.y)**2 if dist_sq <= unit.attack_range**2: enemy.health -= unit.damage # Performance: O(n²) → O(n) # 1000 units: 1,000,000 checks → ~30,000 checks (nearby cells only) # Speedup: ~30-50× for vision/attack queries ``` ### Pattern 2: Quadtree for Clustered Entities **When to use**: Entities cluster in specific areas (cities, battlefields) with sparse regions ```python class Quadtree: """ Adaptive spatial partitioning for clustered entity distributions. Best for: Non-uniform distribution, entities cluster in areas Automatically subdivides dense regions """ class Node: def __init__(self, x, y, width, height, max_entities=10, max_depth=8): self.x = x self.y = y self.width = width self.height = height self.max_entities = max_entities self.max_depth = max_depth self.entities = [] self.children = None # [NW, NE, SW, SE] when subdivided def is_leaf(self): return self.children is None def contains(self, entity): """Check if entity is within this node's bounds""" return (self.x <= entity.x < self.x + self.width and self.y <= entity.y < self.y + self.height) def subdivide(self): """Split into 4 quadrants""" hw = self.width / 2 # half width hh = self.height / 2 # half height # Create 4 children: NW, NE, SW, SE self.children = [ Quadtree.Node(self.x, self.y, hw, hh, self.max_entities, self.max_depth - 1), # NW Quadtree.Node(self.x + hw, self.y, hw, hh, self.max_entities, self.max_depth - 1), # NE Quadtree.Node(self.x, self.y + hh, hw, hh, self.max_entities, self.max_depth - 1), # SW Quadtree.Node(self.x + hw, self.y + hh, hw, hh, self.max_entities, self.max_depth - 1), # SE ] # Move entities to children for entity in self.entities: for child in self.children: if child.contains(entity): child.insert(entity) break self.entities.clear() def insert(self, entity): """Insert entity into quadtree""" if not self.contains(entity): return False if self.is_leaf(): self.entities.append(entity) # Subdivide if over capacity and can go deeper if len(self.entities) > self.max_entities and self.max_depth > 0: self.subdivide() else: # Insert into appropriate child for child in self.children: if child.insert(entity): break return True def query_radius(self, x, y, radius, results): """Find entities within radius of (x, y)""" # Check if search circle intersects this node closest_x = max(self.x, min(x, self.x + self.width)) closest_y = max(self.y, min(y, self.y + self.height)) dx = x - closest_x dy = y - closest_y dist_sq = dx * dx + dy * dy if dist_sq > radius * radius: return # No intersection if self.is_leaf(): # Check entities in this leaf radius_sq = radius * radius for entity in self.entities: dx = entity.x - x dy = entity.y - y if dx * dx + dy * dy <= radius_sq: results.append(entity) else: # Recurse into children for child in self.children: child.query_radius(x, y, radius, results) def __init__(self, world_width, world_height, max_entities=10, max_depth=8): self.root = Quadtree.Node(0, 0, world_width, world_height, max_entities, max_depth) def insert(self, entity): self.root.insert(entity) def query_radius(self, x, y, radius): results = [] self.root.query_radius(x, y, radius, results) return results # Usage quadtree = Quadtree(world_width=1000, world_height=1000, max_entities=10, max_depth=8) # Insert entities for unit in units: quadtree.insert(unit) # Query enemies_nearby = quadtree.query_radius(player.x, player.y, vision_range=50) # Performance: O(log n) average query # Adapts to entity distribution automatically ``` ### Pattern 3: Distance-Based LOD System **Problem**: All entities update at full frequency, wasting CPU on distant entities **Solution**: Update frequency based on distance from camera/player ```python class LODSystem: """ Level-of-detail system with smooth transitions and importance weighting. LOD 0: Full detail (near camera, important entities) LOD 1: Reduced detail (medium distance) LOD 2: Minimal detail (far distance) LOD 3: Dormant (very far, culled) """ # LOD configuration LOD_LEVELS = [ { 'name': 'LOD_0_FULL', 'distance_min': 0, 'distance_max': 50, 'update_hz': 60, # Every frame 'ai_enabled': True, 'pathfinding': 'full', # Precise A* 'animation': 'full', # Full skeleton 'physics': 'full' # Full collision }, { 'name': 'LOD_1_REDUCED', 'distance_min': 50, 'distance_max': 100, 'update_hz': 30, # Every 2 frames 'ai_enabled': True, 'pathfinding': 'hierarchical', 'animation': 'reduced', # 10 bones 'physics': 'bbox' # Bounding box only }, { 'name': 'LOD_2_MINIMAL', 'distance_min': 100, 'distance_max': 200, 'update_hz': 12, # Every 5 frames 'ai_enabled': False, # Scripted only 'pathfinding': 'waypoints', 'animation': 'static', # Static pose 'physics': 'none' }, { 'name': 'LOD_3_CULLED', 'distance_min': 200, 'distance_max': float('inf'), 'update_hz': 2, # Every 30 frames 'ai_enabled': False, 'pathfinding': 'none', 'animation': 'none', 'physics': 'none' } ] def __init__(self, camera, player): self.camera = camera self.player = player self.frame_count = 0 # Hysteresis to prevent LOD thrashing self.hysteresis = 20 # Units of distance buffer def calculate_lod(self, entity): """ Calculate LOD level for entity based on importance and distance. Priority: 1. Importance (player-controlled, in combat, selected) 2. Distance from camera 3. Screen size """ # Important entities always get highest LOD if self._is_important(entity): return 0 # Distance-based LOD distance = self._distance_to_camera(entity) # Current LOD (for hysteresis) current_lod = getattr(entity, 'lod_level', 0) # Determine LOD level with hysteresis for i, lod in enumerate(self.LOD_LEVELS): if i < current_lod: # Upgrading (closer): Use min distance if distance <= lod['distance_max'] - self.hysteresis: return i else: # Downgrading (farther): Use max distance if distance <= lod['distance_max'] + self.hysteresis: return i return len(self.LOD_LEVELS) - 1 def _is_important(self, entity): """Check if entity is important (always highest LOD)""" return (entity.player_controlled or entity.selected or (entity.team == self.player.team and entity.in_combat)) def _distance_to_camera(self, entity): dx = entity.x - self.camera.x dy = entity.y - self.camera.y return math.sqrt(dx * dx + dy * dy) def should_update(self, entity): """Check if entity should update this frame""" lod_level = entity.lod_level lod_config = self.LOD_LEVELS[lod_level] update_hz = lod_config['update_hz'] if update_hz >= 60: return True # Every frame # Calculate frame interval frame_interval = 60 // update_hz # 60 FPS baseline # Offset by entity ID to spread updates across frames return (self.frame_count + entity.id) % frame_interval == 0 def update(self, entities): """Update LOD levels and entities""" self.frame_count += 1 # Update LOD levels (cheap, do every frame) for entity in entities: entity.lod_level = self.calculate_lod(entity) # Update entities based on LOD (expensive, time-sliced) for entity in entities: if self.should_update(entity): lod_config = self.LOD_LEVELS[entity.lod_level] self._update_entity(entity, lod_config) def _update_entity(self, entity, lod_config): """Update entity according to LOD configuration""" if lod_config['ai_enabled']: entity.update_ai() if lod_config['pathfinding'] == 'full': entity.update_pathfinding_full() elif lod_config['pathfinding'] == 'hierarchical': entity.update_pathfinding_hierarchical() elif lod_config['pathfinding'] == 'waypoints': entity.update_pathfinding_waypoints() if lod_config['animation'] != 'none': entity.update_animation(lod_config['animation']) if lod_config['physics'] == 'full': entity.update_physics_full() elif lod_config['physics'] == 'bbox': entity.update_physics_bbox() # Usage lod_system = LODSystem(camera, player) def game_loop(): lod_system.update(units) # Only entities that should_update() this frame were updated # Performance: 1000 units all at LOD 0 → mixed LOD levels # Typical distribution: 100 LOD0 + 300 LOD1 + 400 LOD2 + 200 LOD3 # Effective updates: 100 + 150 + 80 + 7 = 337 updates/frame # Speedup: 1000 → 337 = 3× faster ``` ### Pattern 4: Time-Sliced Pathfinding with Priority Queue **Problem**: 100 path requests × 5ms each = 500ms frame time (2 FPS) **Solution**: Process paths over multiple frames with priority (player units first) ```python import heapq import time from enum import Enum class PathPriority(Enum): """Priority levels for pathfinding requests""" CRITICAL = 0 # Player-controlled, combat HIGH = 1 # Player's units NORMAL = 2 # Visible units LOW = 3 # Off-screen units class PathRequest: def __init__(self, entity, start, goal, priority): self.entity = entity self.start = start self.goal = goal self.priority = priority self.path = None self.complete = False self.timestamp = time.time() class TimeSlicedPathfinder: """ Pathfinding system with frame time budget and priority queue. Features: - 2ms frame budget (stays at 60 FPS) - Priority queue (important requests first) - Incremental pathfinding (spread work over frames) - Request timeout (abandon old requests) """ def __init__(self, budget_ms=2.0, timeout_seconds=5.0): self.budget = budget_ms / 1000.0 # Convert to seconds self.timeout = timeout_seconds self.pending = [] # Priority queue: (priority, request) self.active_request = None self.pathfinder = AStarPathfinder() # Your pathfinding implementation # Statistics self.stats = { 'requests_submitted': 0, 'requests_completed': 0, 'requests_timeout': 0, 'avg_time_to_completion': 0 } def submit_request(self, entity, start, goal, priority=PathPriority.NORMAL): """Submit pathfinding request with priority""" request = PathRequest(entity, start, goal, priority) heapq.heappush(self.pending, (priority.value, request)) self.stats['requests_submitted'] += 1 return request def update(self, dt): """ Process pathfinding requests within frame budget. Returns: Number of paths completed this frame """ start_time = time.perf_counter() completed = 0 while time.perf_counter() - start_time < self.budget: # Get next request if not self.active_request: if not self.pending: break # No more work priority, request = heapq.heappop(self.pending) # Check timeout if time.time() - request.timestamp > self.timeout: self.stats['requests_timeout'] += 1 continue self.active_request = request self.pathfinder.start(request.start, request.goal) # Process active request incrementally # (process up to 100 nodes this frame) done = self.pathfinder.step(max_nodes=100) if done: # Request complete self.active_request.path = self.pathfinder.get_path() self.active_request.complete = True self.active_request.entity.path = self.active_request.path time_to_complete = time.time() - self.active_request.timestamp self._update_avg_time(time_to_complete) self.stats['requests_completed'] += 1 self.active_request = None completed += 1 return completed def _update_avg_time(self, time_to_complete): """Update moving average of completion time""" alpha = 0.1 # Smoothing factor current_avg = self.stats['avg_time_to_completion'] self.stats['avg_time_to_completion'] = ( alpha * time_to_complete + (1 - alpha) * current_avg ) def get_stats(self): """Get performance statistics""" pending_count = len(self.pending) + (1 if self.active_request else 0) return { **self.stats, 'pending_requests': pending_count, 'completion_rate': ( self.stats['requests_completed'] / max(1, self.stats['requests_submitted']) ) } # Usage pathfinder = TimeSlicedPathfinder(budget_ms=2.0) def game_loop(): # Submit pathfinding requests for unit in units_needing_paths: # Determine priority if unit.player_controlled: priority = PathPriority.CRITICAL elif unit.team == player.team: priority = PathPriority.HIGH elif unit.visible: priority = PathPriority.NORMAL else: priority = PathPriority.LOW pathfinder.submit_request(unit, unit.pos, unit.target, priority) # Process paths (stays within 2ms budget) paths_completed = pathfinder.update(dt) # Every 5 seconds, print stats if frame_count % 300 == 0: stats = pathfinder.get_stats() print(f"Pathfinding: {stats['requests_completed']} complete, " f"{stats['pending_requests']} pending, " f"avg time: {stats['avg_time_to_completion']:.2f}s") # Performance: # Without time-slicing: 100 paths × 5ms = 500ms frame (2 FPS) # With time-slicing: 2ms budget per frame = 60 FPS maintained # Paths complete over multiple frames, but high-priority paths finish first ``` ### Pattern 5: LRU Cache with TTL for Pathfinding **Problem**: Recalculating same paths repeatedly wastes CPU **Solution**: Cache paths with LRU eviction and time-to-live ```python import time from collections import OrderedDict class PathCache: """ LRU cache with TTL for pathfinding results. Features: - LRU eviction (least recently used) - TTL expiration (paths become stale) - Region invalidation (terrain changes) - Bounded memory (max size) """ def __init__(self, max_size=5000, ttl_seconds=30.0): self.cache = OrderedDict() # Maintains insertion order for LRU self.max_size = max_size self.ttl = ttl_seconds self.insert_times = {} # Statistics self.stats = { 'hits': 0, 'misses': 0, 'evictions': 0, 'expirations': 0, 'invalidations': 0 } def _make_key(self, start, goal): """Create cache key from start/goal positions""" # Quantize to grid (allows position variance within cell) # Cell size = 5 units (units within 5 units share same path) return ( round(start[0] / 5) * 5, round(start[1] / 5) * 5, round(goal[0] / 5) * 5, round(goal[1] / 5) * 5 ) def get(self, start, goal): """ Get cached path if available and not expired. Returns: Path if cached and valid, None otherwise """ key = self._make_key(start, goal) current_time = time.time() if key not in self.cache: self.stats['misses'] += 1 return None # Check TTL if current_time - self.insert_times[key] > self.ttl: # Expired del self.cache[key] del self.insert_times[key] self.stats['expirations'] += 1 self.stats['misses'] += 1 return None # Cache hit - move to end (most recently used) self.cache.move_to_end(key) self.stats['hits'] += 1 return self.cache[key] def put(self, start, goal, path): """Store path in cache""" key = self._make_key(start, goal) current_time = time.time() # Evict if at capacity (LRU) if len(self.cache) >= self.max_size and key not in self.cache: # Remove oldest (first item in OrderedDict) oldest_key = next(iter(self.cache)) del self.cache[oldest_key] del self.insert_times[oldest_key] self.stats['evictions'] += 1 # Store path self.cache[key] = path self.insert_times[key] = current_time # Move to end (most recently used) self.cache.move_to_end(key) def invalidate_region(self, x, y, radius): """ Invalidate all cached paths in region. Call when terrain changes (building placed, wall destroyed, etc.) """ radius_sq = radius * radius keys_to_remove = [] for key in self.cache: start_x, start_y, goal_x, goal_y = key # Check if start or goal in affected region dx_start = start_x - x dy_start = start_y - y dx_goal = goal_x - x dy_goal = goal_y - y if (dx_start * dx_start + dy_start * dy_start <= radius_sq or dx_goal * dx_goal + dy_goal * dy_goal <= radius_sq): keys_to_remove.append(key) for key in keys_to_remove: del self.cache[key] del self.insert_times[key] self.stats['invalidations'] += 1 def get_hit_rate(self): """Calculate cache hit rate""" total = self.stats['hits'] + self.stats['misses'] if total == 0: return 0.0 return self.stats['hits'] / total def get_stats(self): """Get cache statistics""" return { **self.stats, 'size': len(self.cache), 'hit_rate': self.get_hit_rate() } # Usage path_cache = PathCache(max_size=5000, ttl_seconds=30.0) def find_path(start, goal): # Try cache first cached_path = path_cache.get(start, goal) if cached_path: return cached_path # Cache hit! # Cache miss - calculate path path = expensive_pathfinding(start, goal) path_cache.put(start, goal, path) return path # Invalidate when terrain changes def on_building_placed(building): # Invalidate paths near building path_cache.invalidate_region(building.x, building.y, radius=100) # Print stats periodically def print_cache_stats(): stats = path_cache.get_stats() print(f"Path Cache: {stats['size']}/{path_cache.max_size} entries, " f"hit rate: {stats['hit_rate']:.1%}, " f"{stats['hits']} hits, {stats['misses']} misses") # Performance: # 60% hit rate: Only 40% of requests calculate = 2.5× faster # 80% hit rate: Only 20% of requests calculate = 5× faster ``` ### Pattern 6: Job System for Parallel Work **When to use**: Native code (C++/Rust) with embarrassingly parallel work ```cpp #include #include #include #include #include #include /** * Job system for data-parallel work. * * Features: * - Worker thread pool * - Lock-free job submission (mostly) * - Wait-for-completion * - No shared mutable state (data parallelism) */ class JobSystem { public: using Job = std::function; JobSystem(int num_workers = std::thread::hardware_concurrency()) { workers.reserve(num_workers); for (int i = 0; i < num_workers; ++i) { workers.emplace_back([this]() { this->worker_loop(); }); } } ~JobSystem() { { std::unique_lock lock(queue_mutex); shutdown = true; } queue_cv.notify_all(); for (auto& worker : workers) { worker.join(); } } // Submit single job void submit(Job job) { { std::unique_lock lock(queue_mutex); job_queue.push(std::move(job)); } queue_cv.notify_one(); } // Submit batch of jobs and wait for all to complete void submit_batch_and_wait(const std::vector& jobs) { std::atomic remaining{static_cast(jobs.size())}; std::mutex wait_mutex; std::condition_variable wait_cv; for (const auto& job : jobs) { submit([&, job]() { job(); if (--remaining == 0) { wait_cv.notify_one(); } }); } // Wait for all jobs to complete std::unique_lock lock(wait_mutex); wait_cv.wait(lock, [&]() { return remaining == 0; }); } private: void worker_loop() { while (true) { Job job; { std::unique_lock lock(queue_mutex); queue_cv.wait(lock, [this]() { return !job_queue.empty() || shutdown; }); if (shutdown && job_queue.empty()) { return; } job = std::move(job_queue.front()); job_queue.pop(); } job(); } } std::vector workers; std::queue job_queue; std::mutex queue_mutex; std::condition_variable queue_cv; bool shutdown = false; }; // Usage Example: Parallel position updates struct Unit { float x, y; float vx, vy; void update(float dt) { x += vx * dt; y += vy * dt; } }; void update_units_parallel(std::vector& units, float dt, JobSystem& job_system) { const int num_workers = 8; const int batch_size = units.size() / num_workers; std::vector jobs; for (int worker_id = 0; worker_id < num_workers; ++worker_id) { int start = worker_id * batch_size; int end = (worker_id == num_workers - 1) ? units.size() : start + batch_size; jobs.push_back([&units, dt, start, end]() { // Each worker updates exclusive slice (no locks needed) for (int i = start; i < end; ++i) { units[i].update(dt); } }); } job_system.submit_batch_and_wait(jobs); } // Performance: 4 cores = 2-3× speedup (accounting for overhead) ``` ## Common Pitfalls ### Pitfall 1: Premature Optimization (Most Common!) **Symptoms**: - Jumping to complex solutions (multithreading) before measuring bottleneck - Micro-optimizing (sqrt → squared distance) without profiling - Optimizing code that's 1% of frame time **Why it fails**: - You optimize the wrong thing (80% of time elsewhere) - Complex solutions add bugs without benefit - Time wasted that could go to real bottleneck **Example**: ```python # BAD: Premature micro-optimization # Replaced sqrt with squared distance (saves 0.1ms) # But vision checks are only 1% of frame time! dist_sq = dx*dx + dy*dy if dist_sq < range_sq: # Micro-optimization # ... # GOOD: Profile first, found pathfinding is 80% of frame time # Added path caching (saves 40ms!) cached_path = path_cache.get(start, goal) if cached_path: return cached_path ``` **Solution**: 1. ✅ **Profile FIRST** - measure where time is actually spent 2. ✅ **Focus on top bottleneck** (80/20 rule) 3. ✅ **Measure improvement** - validate optimization helped 4. ✅ **Repeat** - find next bottleneck **Quote**: "Premature optimization is the root of all evil" - Donald Knuth ### Pitfall 2: LOD Popping (Visual Artifacts) **Symptoms**: - Units suddenly appear/disappear at LOD boundaries - Animation quality jumps (smooth → jerky) - Players notice "fake" LOD transitions **Why it fails**: - No hysteresis: Entity at 99-101 units flip-flops between LOD 0/1 every frame - Instant transitions: LOD 0 → LOD 3 in one frame (jarring) - Distance-only: Ignores importance (player's units should always be high detail) **Example**: ```python # BAD: No hysteresis (causes popping) if distance < 100: lod = 0 else: lod = 1 # Entity at 99.5 units: LOD 0 # Entity moves to 100.5 units: LOD 1 # Entity moves to 99.5 units: LOD 0 (flicker!) # GOOD: Hysteresis + importance + blend if is_important(entity): lod = 0 # Always full detail for player units elif distance < 90: lod = 0 # Upgrade at 90 elif distance > 110: lod = 1 # Downgrade at 110 # else: keep current LOD # 20-unit buffer prevents thrashing # Blend between LOD levels over 0.5 seconds blend_factor = (time.time() - lod_transition_start) / 0.5 ``` **Solution**: 1. ✅ **Hysteresis** - different thresholds for upgrade (90) vs downgrade (110) 2. ✅ **Importance weighting** - player units, selected units always high LOD 3. ✅ **Blend transitions** - cross-fade over 0.5-1 second 4. ✅ **Time delay** - wait N seconds before downgrading LOD ### Pitfall 3: Thread Contention and Race Conditions **Symptoms**: - Crashes with "list modified during iteration" - Nondeterministic behavior (works sometimes) - Slower with multithreading than without (due to locking) **Why it fails**: - Multiple threads read/write shared mutable state (data race) - Excessive locking serializes code (defeats parallelism) - False sharing - adjacent data on same cache line thrashes **Example**: ```python # BAD: Race condition (shared mutable list) def update_unit_threaded(unit, all_units): # Thread 1 reads all_units # Thread 2 modifies all_units (adds/removes unit) # Thread 1 crashes: "list changed during iteration" for other in all_units: if collides(unit, other): all_units.remove(other) # RACE! # BAD: Excessive locking (serialized) lock = threading.Lock() def update_unit(unit): with lock: # Only 1 thread works at a time! unit.update() # GOOD: Data parallelism (no shared mutable state) def update_units_parallel(units, num_workers=4): batch_size = len(units) // num_workers def update_batch(start, end): # Exclusive ownership - no locks needed for i in range(start, end): units[i].update() # Only modifies units[i] with ThreadPoolExecutor(max_workers=num_workers) as executor: futures = [] for worker_id in range(num_workers): start = worker_id * batch_size end = start + batch_size if worker_id < num_workers - 1 else len(units) futures.append(executor.submit(update_batch, start, end)) # Wait for all for future in futures: future.result() ``` **Solution**: 1. ✅ **Avoid shared mutable state** - each thread owns exclusive data 2. ✅ **Read-only sharing** - threads can read shared data if no writes 3. ✅ **Message passing** - communicate via queues instead of shared memory 4. ✅ **Lock-free algorithms** - atomic operations, compare-and-swap 5. ✅ **Test with thread sanitizer** - detects data races ### Pitfall 4: Cache Invalidation Bugs **Symptoms**: - Units walk through walls (stale paths cached) - Memory leak (cache grows unbounded) - Crashes after long play sessions (out of memory) **Why it fails**: - No invalidation: Cache never updates when terrain changes - No TTL: Old paths stay forever, become invalid - No eviction: Cache grows until memory exhausted **Example**: ```python # BAD: No invalidation, no TTL, unbounded growth cache = {} def get_path(start, goal): key = (start, goal) if key in cache: return cache[key] # May be stale! path = pathfind(start, goal) cache[key] = path # Cache grows forever! return path # Building placed, but cached paths not invalidated def place_building(x, y): buildings.append(Building(x, y)) # BUG: Paths through this area still cached! # GOOD: LRU + TTL + invalidation cache = PathCache(max_size=5000, ttl_seconds=30.0) def get_path(start, goal): cached = cache.get(start, goal) if cached: return cached path = pathfind(start, goal) cache.put(start, goal, path) return path def place_building(x, y): buildings.append(Building(x, y)) cache.invalidate_region(x, y, radius=100) # Clear affected paths ``` **Solution**: 1. ✅ **TTL (time-to-live)** - expire entries after N seconds 2. ✅ **Event-based invalidation** - clear cache when terrain changes 3. ✅ **LRU eviction** - remove least recently used when full 4. ✅ **Bounded size** - set max_size to prevent unbounded growth ### Pitfall 5: Forgetting to Rebuild Spatial Grid **Symptoms**: - Units see enemies that are no longer there - Collision detection misses fast-moving objects - Query results are stale (from previous frame) **Why it fails**: - Entities move every frame, but grid not rebuilt - Grid contains stale positions **Example**: ```python # BAD: Grid built once, never updated spatial_grid = SpatialHashGrid(cell_size=100) for unit in units: spatial_grid.insert(unit) def game_loop(): # Units move for unit in units: unit.x += unit.vx * dt unit.y += unit.vy * dt # Query stale grid (positions from frame 0!) enemies = spatial_grid.query_radius(player.x, player.y, 50) # GOOD: Rebuild grid every frame def game_loop(): # Move units for unit in units: unit.x += unit.vx * dt unit.y += unit.vy * dt # Rebuild spatial grid (fast: O(n)) spatial_grid.clear() for unit in units: spatial_grid.insert(unit) # Query with current positions enemies = spatial_grid.query_radius(player.x, player.y, 50) ``` **Solution**: 1. ✅ **Rebuild every frame** - spatial_grid.clear() + insert all entities 2. ✅ **Or use dynamic structure** - quadtree with update() method 3. ✅ **Profile rebuild cost** - should be < 1ms for 1000 entities ### Pitfall 6: Optimization Without Validation **Symptoms**: - "Optimized" code runs slower - New bottleneck created elsewhere - Unsure if optimization helped **Why it fails**: - No before/after measurements - Optimization moved bottleneck to different system - Assumptions about cost were wrong **Example**: ```python # BAD: No measurement def optimize_pathfinding(): # Made some changes... # Hope it's faster? pass # GOOD: Measure before and after def optimize_pathfinding(): # Measure baseline start = time.perf_counter() for i in range(100): path = pathfind(start, goal) baseline_ms = (time.perf_counter() - start) * 1000 print(f"Baseline: {baseline_ms:.2f}ms for 100 paths") # Apply optimization... add_path_caching() # Measure improvement start = time.perf_counter() for i in range(100): path = pathfind(start, goal) optimized_ms = (time.perf_counter() - start) * 1000 print(f"Optimized: {optimized_ms:.2f}ms for 100 paths") speedup = baseline_ms / optimized_ms print(f"Speedup: {speedup:.1f}×") # Baseline: 500ms for 100 paths # Optimized: 200ms for 100 paths # Speedup: 2.5× ``` **Solution**: 1. ✅ **Measure baseline** before optimization 2. ✅ **Measure improvement** after optimization 3. ✅ **Calculate speedup** - validate it helped 4. ✅ **Re-profile** - check for new bottlenecks 5. ✅ **Regression test** - ensure gameplay still works ### Pitfall 7: Ignoring Amdahl's Law (Diminishing Returns) **Concept**: Speedup limited by serial portion of code **Amdahl's Law**: `Speedup = 1 / ((1 - P) + P/N)` - P = portion that can be parallelized (e.g., 0.75 = 75%) - N = number of cores (e.g., 4) **Example**: - 75% of code parallelizable, 4 cores - Speedup = 1 / ((1 - 0.75) + 0.75/4) = 1 / (0.25 + 0.1875) = 2.29× - **Not 4×!** Serial portion limits speedup **Why it matters**: - Multithreading has diminishing returns - Focus on parallelizing largest portions first - Some tasks can't be parallelized (Amdahl's law ceiling) **Solution**: 1. ✅ **Parallelize largest bottleneck** first (maximize P) 2. ✅ **Set realistic expectations** (2-3× on 4 cores, not 4×) 3. ✅ **Measure actual speedup** - compare to theoretical maximum ### Pitfall 8: Sorting Every Frame (Expensive!) **Symptoms**: - 3-5ms spent sorting units by distance - Sorting is top function in profiler **Why it fails**: - O(n log n) sort is expensive for large N - Entity distances change slowly (don't need exact sort every frame) **Example**: ```python # BAD: Full sort every frame def update(): # O(n log n) = 1000 × log(1000) ≈ 10,000 operations units_sorted = sorted(units, key=lambda u: distance_to_camera(u)) # Update closest units for unit in units_sorted[:100]: unit.update() # GOOD: Sort every N frames, or use approximate sort def update(): # Re-sort every 10 frames only if frame_count % 10 == 0: global units_sorted units_sorted = sorted(units, key=lambda u: distance_to_camera(u)) # Use slightly stale sort (good enough!) for unit in units_sorted[:100]: unit.update() # BETTER: Use spatial partitioning (no sorting needed!) def update(): # Query entities near camera (already sorted by distance) nearby_units = spatial_grid.query_radius(camera.x, camera.y, radius=200) # Update nearby units for unit in nearby_units: unit.update() ``` **Solution**: 1. ✅ **Sort less frequently** - every 5-10 frames is fine 2. ✅ **Approximate sort** - bucketing instead of exact sort 3. ✅ **Spatial queries** - avoid sorting entirely (use grid/quadtree) ## Real-World Examples ### Example 1: Unity DOTS (Data-Oriented Technology Stack) **What it is**: Unity's high-performance ECS (Entity Component System) architecture **Key optimizations**: 1. **Struct of Arrays (SoA)** - Components stored in contiguous arrays - Traditional: `List` with components scattered in memory - DOTS: `NativeArray`, `NativeArray` - cache-friendly - Result: 1.5-3× faster for batch operations 2. **Job System** - Data parallelism across CPU cores - Each job processes exclusive slice of entities - No locks (data ownership model) - Result: 2-4× speedup on 4-8 core CPUs 3. **Burst Compiler** - LLVM-based code generation - Generates SIMD instructions (AVX2, SSE) - Removes bounds checks, optimizes math - Result: 2-10× faster than standard C# **Performance**: 10,000 entities at 60 FPS (vs 1,000 in traditional Unity) **When to use**: - ✅ 1000+ entities needing updates - ✅ Batch operations (position updates, physics, AI) - ✅ Performance-critical simulations **When NOT to use**: - ❌ Small entity counts (< 100) - ❌ Gameplay prototyping (ECS is complex) - ❌ Unique entities with lots of one-off logic ### Example 2: Supreme Commander (RTS with 1000+ Units) **Challenge**: Support 1000+ units in RTS battles at 30-60 FPS **Optimizations**: 1. **Flow Fields for Pathfinding** - Pre-compute direction field from goal - Each unit follows field (O(1) per unit) - Alternative to A* per unit (O(n log n) each) - Result: 100× faster pathfinding for groups 2. **LOD for Unit AI** - LOD 0 (< 50 units from camera): Full behavior tree - LOD 1 (50-100 units): Simplified FSM - LOD 2 (100+ units): Scripted behavior - Result: 3-5× fewer AI updates per frame 3. **Spatial Partitioning for Weapons** - Grid-based broad-phase for weapon targeting - Only check units in weapon range cells - Result: O(n²) → O(n) for combat calculations 4. **Time-Sliced Sim** - Economy updates: Every 10 frames - Unit production: Every 5 frames - Visual effects: Based on distance LOD - Result: Consistent frame rate under load **Performance**: 1000 units at 30 FPS, 500 units at 60 FPS **Lessons**: - Flow fields > A* for large unit groups - LOD critical for maintaining frame rate at scale - Spatial partitioning is non-negotiable for 1000+ units ### Example 3: Total War (20,000+ Soldiers in Battles) **Challenge**: Render and simulate 20,000 individual soldiers at 30-60 FPS **Optimizations**: 1. **Hierarchical LOD** - LOD 0 (< 20m): Full skeleton, detailed model - LOD 1 (20-50m): Reduced skeleton, simpler model - LOD 2 (50-100m): Impostor (textured quad) - LOD 3 (100m+): Single pixel or culled - Result: 10× fewer vertices rendered 2. **Formation-Based AI** - Units in formation share single pathfinding result - Individual units offset from formation center - Result: 100× fewer pathfinding calculations 3. **Batched Rendering** - Instanced rendering for identical soldiers - 1 draw call for 100 soldiers (vs 100 draw calls) - Result: 10× fewer draw calls 4. **Simplified Physics** - Full physics for nearby units (< 20m) - Ragdolls for deaths near camera - Simplified collision for distant units - Result: 5× fewer physics calculations **Performance**: 20,000 units at 30-60 FPS (depending on settings) **Lessons**: - Visual LOD as important as simulation LOD - Formation-based AI avoids redundant pathfinding - Instanced rendering critical for large unit counts ### Example 4: Cities Skylines (Traffic Simulation) **Challenge**: Simulate 10,000+ vehicles with realistic traffic at 30 FPS **Optimizations**: 1. **Hierarchical Pathfinding** - Highway network → arterial roads → local streets - Pre-compute high-level paths, refine locally - Result: 20× faster pathfinding for long routes 2. **Path Caching** - Common routes cached (home → work, work → home) - 60-80% cache hit rate - Result: 2.5-5× fewer pathfinding calculations 3. **Dynamic Cost Adjustment** - Road segments track vehicle density - Congested roads have higher pathfinding cost - Vehicles reroute around congestion - Result: Emergent traffic patterns 4. **Despawn Distant Vehicles** - Vehicles > 500m from camera despawned - Statistics tracked, respawn when relevant - Result: Effective vehicle count reduced 50% **Performance**: 10,000 active vehicles at 30 FPS **Lessons**: - Hierarchical pathfinding essential for city-scale maps - Path caching provides huge wins (60%+ hit rate common) - Despawning off-screen entities maintains performance ### Example 5: Factorio (Mega-Factory Optimization) **Challenge**: Simulate 100,000+ entities (belts, inserters, assemblers) at 60 FPS **Optimizations**: 1. **Update Skipping** - Idle machines don't update (no input/output) - Active set typically 10-20% of total entities - Result: 5-10× fewer updates per tick 2. **Chunk-Based Simulation** - World divided into 32×32 tile chunks - Inactive chunks (no player nearby) update less often - Result: Effective world size reduced 80% 3. **Belt Optimization** - Items on belts compressed into contiguous arrays - Lane-based updates (not per-item) - Result: 10× faster belt simulation 4. **Electrical Network Caching** - Power grid solved once, cached until topology changes - Only recalculate when grid modified - Result: 100× fewer electrical calculations **Performance**: 60 FPS with 100,000+ entities (in optimized factories) **Lessons**: - Update skipping (sleeping entities) provides huge wins - Chunk-based simulation scales to massive worlds - Cache static calculations (power grid, fluid networks) ## Cross-References ### Within Bravos/Simulation-Tactics **This skill applies to ALL other simulation skills**: - **traffic-and-pathfinding** ← Optimize pathfinding with caching, time-slicing - **ai-and-agent-simulation** ← LOD for AI, time-sliced behavior trees - **physics-simulation-patterns** ← Spatial partitioning for collision, broad-phase - **ecosystem-simulation** ← LOD for distant populations, time-sliced updates - **weather-and-time** ← Particle budgets, LOD for effects - **economic-simulation-patterns** ← Time-slicing for economy updates **Related skills in this skillpack**: - **spatial-partitioning** (planned) - Deep dive into quadtrees, octrees, grids - **ecs-architecture** (planned) - Data-oriented design, component systems ### External Skillpacks **Yzmir/Performance-Optimization** (if exists): - Profiling tools and methodology - Memory optimization (pooling, allocators) - Cache optimization (data layouts) **Yzmir/Algorithms-and-Data-Structures** (if exists): - Spatial data structures (quadtree, k-d tree, BVH) - Priority queues (for time-slicing) - LRU cache implementation **Axiom/Game-Engine-Patterns** (if exists): - Update loop patterns - Frame time management - Object pooling ## Testing Checklist Use this checklist to verify optimization is complete and correct: ### 1. Profiling - [ ] Captured baseline performance (frame time, FPS) - [ ] Identified top 3-5 bottlenecks (80% of time) - [ ] Understood WHY each bottleneck is slow (algorithm, data, cache) - [ ] Documented baseline metrics for comparison ### 2. Algorithmic Optimization - [ ] Checked for O(n²) algorithms (proximity queries, collisions) - [ ] Applied spatial partitioning where appropriate (grid, quadtree) - [ ] Validated spatial queries return correct results - [ ] Measured improvement (should be 10-100×) ### 3. Level of Detail (LOD) - [ ] Defined LOD levels (typically 4: full, reduced, minimal, culled) - [ ] Implemented distance-based LOD assignment - [ ] Added importance weighting (player units, selected units) - [ ] Implemented hysteresis to prevent LOD thrashing - [ ] Verified no visual popping artifacts - [ ] Measured improvement (should be 2-10×) ### 4. Time-Slicing - [ ] Set frame time budget per system (e.g., 2ms for pathfinding) - [ ] Implemented priority queue (important work first) - [ ] Verified budget is respected (doesn't exceed limit) - [ ] Checked that high-priority work completes quickly - [ ] Measured improvement (should be 2-5×) ### 5. Caching - [ ] Identified redundant calculations to cache - [ ] Implemented cache with LRU eviction - [ ] Added TTL (time-to-live) expiration - [ ] Implemented invalidation triggers (terrain changes, etc.) - [ ] Verified cache hit rate (aim for 60-80%) - [ ] Checked no stale data bugs (units walking through walls) - [ ] Measured improvement (should be 2-10×) ### 6. Data-Oriented Design (if applicable) - [ ] Identified batch operations on many entities - [ ] Converted AoS → SoA for hot data - [ ] Verified memory layout is cache-friendly - [ ] Measured improvement (should be 1.5-3×) ### 7. Multithreading (if needed) - [ ] Verified all simpler optimizations done first - [ ] Identified embarrassingly parallel work - [ ] Implemented job system or data parallelism - [ ] Verified no race conditions (test with thread sanitizer) - [ ] Checked performance gain justifies complexity - [ ] Measured improvement (should be 1.5-4×) ### 8. Validation - [ ] Met target frame rate (60 FPS or 30 FPS) - [ ] Verified no gameplay regressions (units behave correctly) - [ ] Checked no visual artifacts (LOD popping, etc.) - [ ] Tested at target entity count (e.g., 1000 units) - [ ] Tested edge cases (10,000 units, worst-case scenarios) - [ ] Documented final performance metrics - [ ] Calculated total speedup (baseline → optimized) ### 9. Before/After Comparison | Metric | Before | After | Improvement | |--------|--------|-------|-------------| | **Frame Time** | ___ms | ___ms | ___× faster | | **FPS** | ___ | ___ | ___ | | **Bottleneck System Time** | ___ms | ___ms | ___× faster | | **Entity Count (target FPS)** | ___ | ___ | ___× more | | **Memory Usage** | ___MB | ___MB | ___ | ### 10. Regression Tests - [ ] Units still path correctly (no walking through walls) - [ ] AI behavior unchanged (same decisions) - [ ] Combat calculations correct (same damage) - [ ] No crashes or exceptions - [ ] No memory leaks (long play session test) - [ ] Deterministic results (same input → same output) **Remember**: 1. **Profile FIRST** - measure before guessing 2. **Algorithmic optimization** provides biggest wins (10-100×) 3. **LOD and time-slicing** are essential for 1000+ entities 4. **Multithreading is LAST resort** - complexity cost is high 5. **Validate improvement** - measure before/after, check for regressions **Success criteria**: Target frame rate achieved (60 FPS) with desired entity count (1000+) and no gameplay compromises.