# RL Evaluation: Rigorous Methodology for Agent Assessment RL evaluation is uniquely challenging due to high variance, temporal instability, environment overfitting, and sample efficiency considerations. Without rigorous methodology, you will: - Draw conclusions from statistical noise - Report results that don't generalize - Deploy agents that fail in production - Waste resources on false improvements This skill provides systematic evaluation protocols that ensure statistical validity, generalization measurement, and deployment-ready assessment. ## When to Use This Skill Use this skill when: - ✅ Evaluating RL agent performance - ✅ Comparing multiple RL algorithms - ✅ Reporting results for publication or deployment - ✅ Making algorithm selection decisions - ✅ Assessing readiness for production deployment - ✅ Debugging training (need accurate performance estimates) DO NOT use for: - ❌ Quick sanity checks during development (use informal evaluation) - ❌ Monitoring training progress (use running averages) - ❌ Initial hyperparameter sweeps (use coarse evaluation) **When in doubt:** If the evaluation result will inform a decision (publish, deploy, choose algorithm), use this skill. ## Core Principles ### Principle 1: Statistical Rigor is Non-Negotiable **Reality:** RL has inherently high variance. Single runs are meaningless. **Enforcement:** - Minimum 5-10 random seeds for any performance claim - Report mean ± std or 95% confidence intervals - Statistical significance testing when comparing algorithms - Never report single-seed results as representative ### Principle 2: Train/Test Discipline Prevents Overfitting **Reality:** Agents exploit environment quirks. Training performance ≠ generalization. **Enforcement:** - Separate train/test environment instances - Different random seeds for train/eval - Test on distribution shifts (new instances, physics, appearances) - Report both training and generalization performance ### Principle 3: Sample Efficiency Matters **Reality:** Final performance ignores cost. Samples are often expensive. **Enforcement:** - Report sample efficiency curves (reward vs steps) - Include "reward at X steps" for multiple budgets - Consider deployment constraints - Compare at SAME sample budget, not just asymptotic ### Principle 4: Evaluation Mode Must Match Deployment **Reality:** Stochastic vs deterministic evaluation changes results by 10-30%. **Enforcement:** - Specify evaluation mode (stochastic/deterministic) - Match evaluation to deployment scenario - Report both if ambiguous - Explain choice in methodology ### Principle 5: Offline RL Requires Special Care **Reality:** Cannot accurately evaluate offline RL without online rollouts. **Enforcement:** - Acknowledge evaluation limitations - Use conservative metrics (in-distribution performance) - Quantify uncertainty - Staged deployment (offline → small online trial → full) ## Statistical Evaluation Protocol ### Multi-Seed Evaluation (MANDATORY) **Minimum Requirements:** - **Exploration/research**: 5-10 seeds minimum - **Publication**: 10-20 seeds - **Production deployment**: 20-50 seeds (depending on variance) **Protocol:** ```python import numpy as np from scipy import stats def evaluate_multi_seed(algorithm, env_name, seeds, total_steps): """ Evaluate algorithm across multiple random seeds. Args: algorithm: RL algorithm class env_name: Environment name seeds: List of random seeds total_steps: Training steps per seed Returns: Dictionary with statistics """ final_rewards = [] sample_efficiency_curves = [] for seed in seeds: # Train agent env = gym.make(env_name, seed=seed) agent = algorithm(env, seed=seed) # Track performance during training eval_points = np.linspace(0, total_steps, num=20, dtype=int) curve = [] for step in eval_points: agent.train(steps=step) reward = evaluate_deterministic(agent, env, episodes=10) curve.append((step, reward)) sample_efficiency_curves.append(curve) final_rewards.append(curve[-1][1]) # Final performance final_rewards = np.array(final_rewards) return { 'mean': np.mean(final_rewards), 'std': np.std(final_rewards), 'median': np.median(final_rewards), 'min': np.min(final_rewards), 'max': np.max(final_rewards), 'iqr': (np.percentile(final_rewards, 75) - np.percentile(final_rewards, 25)), 'confidence_interval_95': stats.t.interval( 0.95, len(final_rewards) - 1, loc=np.mean(final_rewards), scale=stats.sem(final_rewards) ), 'all_seeds': final_rewards, 'curves': sample_efficiency_curves } # Usage results = evaluate_multi_seed( algorithm=PPO, env_name="HalfCheetah-v3", seeds=range(10), # 10 seeds total_steps=1_000_000 ) print(f"Performance: {results['mean']:.1f} ± {results['std']:.1f}") print(f"95% CI: [{results['confidence_interval_95'][0]:.1f}, " f"{results['confidence_interval_95'][1]:.1f}]") print(f"Median: {results['median']:.1f}") print(f"Range: [{results['min']:.1f}, {results['max']:.1f}]") ``` **Reporting Template:** ``` Algorithm: PPO Environment: HalfCheetah-v3 Seeds: 10 Total Steps: 1M Final Performance: - Mean: 4,523 ± 387 - Median: 4,612 - 95% CI: [4,246, 4,800] - Range: [3,812, 5,201] Sample Efficiency: - Reward at 100k steps: 1,234 ± 156 - Reward at 500k steps: 3,456 ± 289 - Reward at 1M steps: 4,523 ± 387 ``` ### Statistical Significance Testing **When comparing algorithms:** ```python def compare_algorithms(results_A, results_B, alpha=0.05): """ Compare two algorithms with statistical rigor. Args: results_A: Array of final rewards for algorithm A (multiple seeds) results_B: Array of final rewards for algorithm B (multiple seeds) alpha: Significance level (default 0.05) Returns: Dictionary with comparison statistics """ # T-test for difference in means t_statistic, p_value = stats.ttest_ind(results_A, results_B) # Effect size (Cohen's d) pooled_std = np.sqrt((np.std(results_A)**2 + np.std(results_B)**2) / 2) cohens_d = (np.mean(results_A) - np.mean(results_B)) / pooled_std # Bootstrap confidence interval for difference def bootstrap_diff(n_bootstrap=10000): diffs = [] for _ in range(n_bootstrap): sample_A = np.random.choice(results_A, size=len(results_A)) sample_B = np.random.choice(results_B, size=len(results_B)) diffs.append(np.mean(sample_A) - np.mean(sample_B)) return np.percentile(diffs, [2.5, 97.5]) ci_diff = bootstrap_diff() return { 'mean_A': np.mean(results_A), 'mean_B': np.mean(results_B), 'difference': np.mean(results_A) - np.mean(results_B), 'p_value': p_value, 'significant': p_value < alpha, 'cohens_d': cohens_d, 'ci_difference': ci_diff, 'conclusion': ( f"Algorithm A is {'significantly' if p_value < alpha else 'NOT significantly'} " f"better than B (p={p_value:.4f})" ) } # Usage ppo_results = np.array([4523, 4612, 4201, 4789, 4456, 4390, 4678, 4234, 4567, 4498]) sac_results = np.array([4678, 4890, 4567, 4923, 4712, 4645, 4801, 4556, 4734, 4689]) comparison = compare_algorithms(ppo_results, sac_results) print(comparison['conclusion']) print(f"Effect size (Cohen's d): {comparison['cohens_d']:.3f}") print(f"95% CI for difference: [{comparison['ci_difference'][0]:.1f}, " f"{comparison['ci_difference'][1]:.1f}]") ``` **Interpreting Effect Size (Cohen's d):** - d < 0.2: Negligible difference - 0.2 ≤ d < 0.5: Small effect - 0.5 ≤ d < 0.8: Medium effect - d ≥ 0.8: Large effect **Red Flag:** If p-value < 0.05 but Cohen's d < 0.2, the difference is statistically significant but practically negligible. Don't claim "better" without practical significance. ### Power Analysis: How Many Seeds Needed? ```python def required_seeds_for_precision(std_estimate, mean_estimate, desired_precision=0.1, confidence=0.95): """ Calculate number of seeds needed for desired precision. Args: std_estimate: Estimated standard deviation (from pilot runs) mean_estimate: Estimated mean performance desired_precision: Desired precision as fraction of mean (0.1 = ±10%) confidence: Confidence level (0.95 = 95% CI) Returns: Required number of seeds """ # Z-score for confidence level z = stats.norm.ppf(1 - (1 - confidence) / 2) # Desired margin of error margin = desired_precision * mean_estimate # Required sample size n = (z * std_estimate / margin) ** 2 return int(np.ceil(n)) # Example: You ran 3 pilot seeds pilot_results = [4500, 4200, 4700] std_est = np.std(pilot_results) # 250 mean_est = np.mean(pilot_results) # 4467 # How many seeds for ±10% precision at 95% confidence? n_required = required_seeds_for_precision(std_est, mean_est, desired_precision=0.1) print(f"Need {n_required} seeds for ±10% precision") # ~12 seeds # How many for ±5% precision? n_tight = required_seeds_for_precision(std_est, mean_est, desired_precision=0.05) print(f"Need {n_tight} seeds for ±5% precision") # ~47 seeds ``` **Practical Guidelines:** - Quick comparison: 5 seeds (±20% precision) - Standard evaluation: 10 seeds (±10% precision) - Publication: 20 seeds (±7% precision) - Production deployment: 50+ seeds (±5% precision) ## Train/Test Discipline ### Environment Instance Separation **CRITICAL:** Never evaluate on the same environment instances used for training. ```python # WRONG: Single environment for both training and evaluation env = gym.make("CartPole-v1", seed=42) agent.train(env) performance = evaluate(agent, env) # BIASED! # CORRECT: Separate environments train_env = gym.make("CartPole-v1", seed=42) eval_env = gym.make("CartPole-v1", seed=999) # Different seed agent.train(train_env) performance = evaluate(agent, eval_env) # Unbiased ``` ### Train/Test Split for Custom Environments **For environments with multiple instances (levels, objects, configurations):** ```python def create_train_test_split(all_instances, test_ratio=0.2, seed=42): """ Split environment instances into train and test sets. Args: all_instances: List of environment configurations test_ratio: Fraction for test set (default 0.2) seed: Random seed for reproducibility Returns: (train_instances, test_instances) """ np.random.seed(seed) n_test = int(len(all_instances) * test_ratio) indices = np.random.permutation(len(all_instances)) test_indices = indices[:n_test] train_indices = indices[n_test:] train_instances = [all_instances[i] for i in train_indices] test_instances = [all_instances[i] for i in test_indices] return train_instances, test_instances # Example: Maze environments all_mazes = [MazeLayout(seed=i) for i in range(100)] train_mazes, test_mazes = create_train_test_split(all_mazes, test_ratio=0.2) print(f"Training on {len(train_mazes)} mazes") # 80 print(f"Testing on {len(test_mazes)} mazes") # 20 # Train only on training set agent.train(train_mazes) # Evaluate on BOTH train and test (measure generalization gap) train_performance = evaluate(agent, train_mazes) test_performance = evaluate(agent, test_mazes) generalization_gap = train_performance - test_performance print(f"Train: {train_performance:.1f}") print(f"Test: {test_performance:.1f}") print(f"Generalization gap: {generalization_gap:.1f}") # Red flag: If gap > 20% of train performance, agent is overfitting if generalization_gap > 0.2 * train_performance: print("WARNING: Significant overfitting detected!") ``` ### Randomization Protocol **Ensure independent randomization for train/eval:** ```python class EvaluationProtocol: def __init__(self, env_name, train_seed=42, eval_seed=999): """ Proper train/eval environment management. Args: env_name: Gym environment name train_seed: Seed for training environment eval_seed: Seed for evaluation environment (DIFFERENT) """ self.env_name = env_name self.train_seed = train_seed self.eval_seed = eval_seed # Separate environments self.train_env = gym.make(env_name) self.train_env.seed(train_seed) self.train_env.action_space.seed(train_seed) self.train_env.observation_space.seed(train_seed) self.eval_env = gym.make(env_name) self.eval_env.seed(eval_seed) self.eval_env.action_space.seed(eval_seed) self.eval_env.observation_space.seed(eval_seed) def train_step(self, agent): """Training step on training environment.""" return agent.step(self.train_env) def evaluate(self, agent, episodes=100): """Evaluation on SEPARATE evaluation environment.""" rewards = [] for _ in range(episodes): state = self.eval_env.reset() episode_reward = 0 done = False while not done: action = agent.act_deterministic(state) state, reward, done, _ = self.eval_env.step(action) episode_reward += reward rewards.append(episode_reward) return np.mean(rewards), np.std(rewards) # Usage protocol = EvaluationProtocol("HalfCheetah-v3", train_seed=42, eval_seed=999) # Training agent = SAC() for step in range(1_000_000): protocol.train_step(agent) if step % 10_000 == 0: mean_reward, std_reward = protocol.evaluate(agent, episodes=10) print(f"Step {step}: {mean_reward:.1f} ± {std_reward:.1f}") ``` ## Sample Efficiency Metrics ### Sample Efficiency Curves **Report performance at multiple sample budgets, not just final:** ```python def compute_sample_efficiency_curve(agent_class, env_name, seed, max_steps, eval_points=20): """ Compute sample efficiency curve (reward vs steps). Args: agent_class: RL algorithm class env_name: Environment name seed: Random seed max_steps: Maximum training steps eval_points: Number of evaluation points Returns: List of (steps, reward) tuples """ env = gym.make(env_name, seed=seed) agent = agent_class(env, seed=seed) eval_steps = np.logspace(3, np.log10(max_steps), num=eval_points, dtype=int) # [1000, 1500, 2200, ..., max_steps] (logarithmic spacing) curve = [] current_step = 0 for target_step in eval_steps: # Train until target_step steps_to_train = target_step - current_step agent.train(steps=steps_to_train) current_step = target_step # Evaluate reward = evaluate_deterministic(agent, env, episodes=10) curve.append((target_step, reward)) return curve # Compare sample efficiency of multiple algorithms algorithms = [PPO, SAC, TD3] env_name = "HalfCheetah-v3" max_steps = 1_000_000 for algo in algorithms: # Average across 5 seeds all_curves = [] for seed in range(5): curve = compute_sample_efficiency_curve(algo, env_name, seed, max_steps) all_curves.append(curve) # Aggregate steps = [point[0] for point in all_curves[0]] rewards_at_step = [[curve[i][1] for curve in all_curves] for i in range(len(steps))] mean_rewards = [np.mean(rewards) for rewards in rewards_at_step] std_rewards = [np.std(rewards) for rewards in rewards_at_step] # Report at specific budgets for i, step in enumerate([100_000, 500_000, 1_000_000]): idx = steps.index(step) print(f"{algo.__name__} at {step} steps: " f"{mean_rewards[idx]:.1f} ± {std_rewards[idx]:.1f}") ``` **Sample Output:** ``` PPO at 100k steps: 1,234 ± 156 PPO at 500k steps: 3,456 ± 289 PPO at 1M steps: 4,523 ± 387 SAC at 100k steps: 891 ± 178 SAC at 500k steps: 3,789 ± 245 SAC at 1M steps: 4,912 ± 312 TD3 at 100k steps: 756 ± 134 TD3 at 500k steps: 3,234 ± 298 TD3 at 1M steps: 4,678 ± 276 ``` **Analysis:** - PPO is most sample-efficient early (1,234 at 100k) - SAC has best final performance (4,912 at 1M) - If sample budget is 100k → PPO is best choice - If sample budget is 1M → SAC is best choice ### Area Under Curve (AUC) Metric **Single metric for sample efficiency:** ```python def compute_auc(curve): """ Compute area under sample efficiency curve. Args: curve: List of (steps, reward) tuples Returns: AUC value (higher = more sample efficient) """ steps = np.array([point[0] for point in curve]) rewards = np.array([point[1] for point in curve]) # Trapezoidal integration auc = np.trapz(rewards, steps) return auc # Compare algorithms by AUC for algo in algorithms: all_aucs = [] for seed in range(5): curve = compute_sample_efficiency_curve(algo, env_name, seed, max_steps) auc = compute_auc(curve) all_aucs.append(auc) print(f"{algo.__name__} AUC: {np.mean(all_aucs):.2e} ± {np.std(all_aucs):.2e}") ``` **Note:** AUC is sensitive to evaluation point spacing. Use consistent evaluation points across algorithms. ## Generalization Testing ### Distribution Shift Evaluation **Test on environment variations to measure robustness:** ```python def evaluate_generalization(agent, env_name, shifts): """ Evaluate agent on distribution shifts. Args: agent: Trained RL agent env_name: Base environment name shifts: Dictionary of shift types and parameters Returns: Dictionary of performance on each shift """ results = {} # Baseline (no shift) baseline_env = gym.make(env_name) baseline_perf = evaluate(agent, baseline_env, episodes=50) results['baseline'] = baseline_perf # Test shifts for shift_name, shift_params in shifts.items(): shifted_env = apply_shift(env_name, shift_params) shift_perf = evaluate(agent, shifted_env, episodes=50) results[shift_name] = shift_perf # Compute degradation degradation = (baseline_perf - shift_perf) / baseline_perf results[f'{shift_name}_degradation'] = degradation return results # Example: Robotic grasping shifts = { 'lighting_dim': {'lighting_scale': 0.5}, 'lighting_bright': {'lighting_scale': 1.5}, 'camera_angle_15deg': {'camera_rotation': 15}, 'table_height_+5cm': {'table_height_offset': 0.05}, 'object_mass_+50%': {'mass_scale': 1.5}, 'object_friction_-30%': {'friction_scale': 0.7} } gen_results = evaluate_generalization(agent, "RobotGrasp-v1", shifts) print(f"Baseline: {gen_results['baseline']:.2%} success") for shift_name in shifts.keys(): perf = gen_results[shift_name] deg = gen_results[f'{shift_name}_degradation'] print(f"{shift_name}: {perf:.2%} success ({deg:.1%} degradation)") # Red flag: If any degradation > 50%, agent is brittle ``` ### Zero-Shot Transfer Evaluation **Test on completely new environments:** ```python def zero_shot_transfer(agent, train_env_name, test_env_names): """ Evaluate zero-shot transfer to related environments. Args: agent: Agent trained on train_env_name train_env_name: Training environment test_env_names: List of related test environments Returns: Transfer performance dictionary """ results = {} # Source performance source_env = gym.make(train_env_name) source_perf = evaluate(agent, source_env, episodes=50) results['source'] = source_perf # Target performances for target_env_name in test_env_names: target_env = gym.make(target_env_name) target_perf = evaluate(agent, target_env, episodes=50) results[target_env_name] = target_perf # Transfer efficiency transfer_ratio = target_perf / source_perf results[f'{target_env_name}_transfer_ratio'] = transfer_ratio return results # Example: Locomotion transfer agent_trained_on_cheetah = train(PPO, "HalfCheetah-v3") transfer_results = zero_shot_transfer( agent_trained_on_cheetah, train_env_name="HalfCheetah-v3", test_env_names=["Hopper-v3", "Walker2d-v3", "Ant-v3"] ) print(f"Source (HalfCheetah): {transfer_results['source']:.1f}") for env in ["Hopper-v3", "Walker2d-v3", "Ant-v3"]: perf = transfer_results[env] ratio = transfer_results[f'{env}_transfer_ratio'] print(f"{env}: {perf:.1f} ({ratio:.1%} of source)") ``` ### Robustness to Adversarial Perturbations **Test against worst-case scenarios:** ```python def adversarial_evaluation(agent, env, perturbation_types, perturbation_magnitudes): """ Evaluate robustness to adversarial perturbations. Args: agent: RL agent to evaluate env: Environment perturbation_types: List of perturbation types perturbation_magnitudes: List of magnitudes to test Returns: Robustness curve for each perturbation type """ results = {} for perturb_type in perturbation_types: results[perturb_type] = [] for magnitude in perturbation_magnitudes: # Apply perturbation perturbed_env = add_perturbation(env, perturb_type, magnitude) # Evaluate perf = evaluate(agent, perturbed_env, episodes=20) results[perturb_type].append((magnitude, perf)) return results # Example: Vision-based control perturbation_types = ['gaussian_noise', 'occlusion', 'brightness'] magnitudes = [0.0, 0.1, 0.2, 0.3, 0.5] robustness = adversarial_evaluation( agent, env, perturbation_types, magnitudes ) for perturb_type, curve in robustness.items(): print(f"\n{perturb_type}:") for magnitude, perf in curve: print(f" Magnitude {magnitude}: {perf:.1f} reward") ``` ## Evaluation Protocols ### Stochastic vs Deterministic Evaluation **Decision Tree:** ``` Is the policy inherently deterministic? ├─ YES (DQN, DDPG without noise) │ └─ Use deterministic evaluation └─ NO (PPO, SAC, stochastic policies) ├─ Will deployment use stochastic policy? │ ├─ YES (dialogue, exploration needed) │ │ └─ Use stochastic evaluation │ └─ NO (control, deterministic deployment) │ └─ Use deterministic evaluation └─ Unsure? └─ Report BOTH stochastic and deterministic ``` **Implementation:** ```python class EvaluationMode: @staticmethod def deterministic(agent, env, episodes=100): """ Deterministic evaluation (use mean/argmax action). """ rewards = [] for _ in range(episodes): state = env.reset() episode_reward = 0 done = False while not done: # Use mean action (no sampling) if hasattr(agent, 'act_deterministic'): action = agent.act_deterministic(state) else: action = agent.policy.mean(state) # Or argmax for discrete state, reward, done, _ = env.step(action) episode_reward += reward rewards.append(episode_reward) return np.mean(rewards), np.std(rewards) @staticmethod def stochastic(agent, env, episodes=100): """ Stochastic evaluation (sample from policy). """ rewards = [] for _ in range(episodes): state = env.reset() episode_reward = 0 done = False while not done: # Sample from policy distribution action = agent.policy.sample(state) state, reward, done, _ = env.step(action) episode_reward += reward rewards.append(episode_reward) return np.mean(rewards), np.std(rewards) @staticmethod def report_both(agent, env, episodes=100): """ Report both evaluation modes for transparency. """ det_mean, det_std = EvaluationMode.deterministic(agent, env, episodes) sto_mean, sto_std = EvaluationMode.stochastic(agent, env, episodes) return { 'deterministic': {'mean': det_mean, 'std': det_std}, 'stochastic': {'mean': sto_mean, 'std': sto_std}, 'difference': det_mean - sto_mean } # Usage sac_agent = SAC(env) sac_agent.train(steps=1_000_000) eval_results = EvaluationMode.report_both(sac_agent, env, episodes=100) print(f"Deterministic: {eval_results['deterministic']['mean']:.1f} " f"± {eval_results['deterministic']['std']:.1f}") print(f"Stochastic: {eval_results['stochastic']['mean']:.1f} " f"± {eval_results['stochastic']['std']:.1f}") print(f"Difference: {eval_results['difference']:.1f}") ``` **Interpretation:** - If difference < 5% of mean: Evaluation mode doesn't matter much - If difference > 15% of mean: Evaluation mode significantly affects results - Must clearly specify which mode used - Ensure fair comparison across algorithms (same mode) ### Episode Count Selection **How many evaluation episodes needed?** ```python def required_eval_episodes(env, agent, desired_sem, max_episodes=1000): """ Determine number of evaluation episodes for desired standard error. Args: env: Environment agent: Agent to evaluate desired_sem: Desired standard error of mean max_episodes: Maximum episodes to test Returns: Required number of episodes """ # Run initial episodes to estimate variance initial_episodes = min(20, max_episodes) rewards = [] for _ in range(initial_episodes): state = env.reset() episode_reward = 0 done = False while not done: action = agent.act_deterministic(state) state, reward, done, _ = env.step(action) episode_reward += reward rewards.append(episode_reward) # Estimate standard deviation std_estimate = np.std(rewards) # Required episodes: n = (std / desired_sem)^2 required = int(np.ceil((std_estimate / desired_sem) ** 2)) return min(required, max_episodes) # Usage agent = PPO(env) agent.train(steps=1_000_000) # Want standard error < 10 reward units n_episodes = required_eval_episodes(env, agent, desired_sem=10) print(f"Need {n_episodes} episodes for SEM < 10") # Evaluate with required episodes final_eval = evaluate(agent, env, episodes=n_episodes) ``` **Rule of Thumb:** - Quick check: 10 episodes - Standard evaluation: 50-100 episodes - Publication/deployment: 100-200 episodes - High-variance environments: 500+ episodes ### Evaluation Frequency During Training **How often to evaluate during training?** ```python def adaptive_evaluation_schedule(total_steps, early_freq=1000, late_freq=10000, transition_step=100000): """ Create adaptive evaluation schedule. Early training: Frequent evaluations (detect divergence early) Late training: Infrequent evaluations (policy more stable) Args: total_steps: Total training steps early_freq: Evaluation frequency in early training late_freq: Evaluation frequency in late training transition_step: Step to transition from early to late Returns: List of evaluation timesteps """ eval_steps = [] # Early phase current_step = 0 while current_step < transition_step: eval_steps.append(current_step) current_step += early_freq # Late phase while current_step < total_steps: eval_steps.append(current_step) current_step += late_freq # Always evaluate at end if eval_steps[-1] != total_steps: eval_steps.append(total_steps) return eval_steps # Usage schedule = adaptive_evaluation_schedule( total_steps=1_000_000, early_freq=1_000, # Every 1k steps for first 100k late_freq=10_000, # Every 10k steps after 100k transition_step=100_000 ) print(f"Total evaluations: {len(schedule)}") print(f"First 10 eval steps: {schedule[:10]}") print(f"Last 10 eval steps: {schedule[-10:]}") # Training loop agent = PPO(env) for step in range(1_000_000): agent.train_step() if step in schedule: eval_perf = evaluate(agent, eval_env, episodes=10) log(step, eval_perf) ``` **Guidelines:** - Evaluation is expensive (10-100 episodes × episode length) - Early training: Evaluate frequently to detect divergence - Late training: Evaluate less frequently (policy stabilizes) - Don't evaluate every step (wastes compute) - Save checkpoints at evaluation steps (for later analysis) ## Offline RL Evaluation ### The Offline RL Evaluation Problem **CRITICAL:** You cannot accurately evaluate offline RL policies without online rollouts. **Why:** - Learned Q-values are only accurate for data distribution - Policy wants to visit out-of-distribution states - Q-values for OOD states are extrapolated (unreliable) - Dataset doesn't contain policy's trajectories **What to do:** ```python class OfflineRLEvaluation: """ Conservative offline RL evaluation protocol. """ @staticmethod def in_distribution_performance(offline_dataset, policy): """ Evaluate policy on dataset trajectories (lower bound). This measures: "How well does policy match best trajectories in dataset?" NOT "How good is the policy?" """ returns = [] for trajectory in offline_dataset: # Check if policy would generate this trajectory policy_match = True for (state, action) in trajectory: policy_action = policy(state) if not actions_match(policy_action, action): policy_match = False break if policy_match: returns.append(trajectory.return) if len(returns) == 0: return None # Policy doesn't match any dataset trajectory return np.mean(returns) @staticmethod def behavioral_cloning_baseline(offline_dataset): """ Train behavior cloning on dataset (baseline). Offline RL should outperform BC, otherwise it's not learning. """ bc_policy = BehaviorCloning(offline_dataset) bc_policy.train() return bc_policy @staticmethod def model_based_evaluation(offline_dataset, policy, model): """ Use learned dynamics model for evaluation (if available). CAUTION: Model errors compound. Short rollouts only. """ # Train dynamics model on dataset model.train(offline_dataset) # Generate short rollouts (5-10 steps) rollout_returns = [] for _ in range(100): state = sample_initial_state(offline_dataset) rollout_return = 0 for step in range(10): # Short rollouts only action = policy(state) next_state, reward = model.predict(state, action) rollout_return += reward state = next_state rollout_returns.append(rollout_return) # Heavy discount for model uncertainty uncertainty = model.get_uncertainty(offline_dataset) adjusted_return = np.mean(rollout_returns) * (1 - uncertainty) return adjusted_return @staticmethod def state_coverage_metric(offline_dataset, policy, num_rollouts=100): """ Measure how much policy stays in-distribution. Low coverage → policy goes OOD → evaluation unreliable """ # Get dataset state distribution dataset_states = get_all_states(offline_dataset) # Simulate policy rollouts policy_states = [] for _ in range(num_rollouts): trajectory = simulate_with_model(policy) # Needs model policy_states.extend(trajectory.states) # Compute coverage (fraction of policy states near dataset states) coverage = compute_coverage(policy_states, dataset_states) return coverage @staticmethod def full_offline_evaluation(offline_dataset, policy): """ Comprehensive offline evaluation (still conservative). """ results = {} # 1. In-distribution performance results['in_dist_perf'] = OfflineRLEvaluation.in_distribution_performance( offline_dataset, policy ) # 2. Compare to behavior cloning bc_policy = OfflineRLEvaluation.behavioral_cloning_baseline(offline_dataset) results['bc_baseline'] = evaluate(bc_policy, offline_dataset) # 3. Model-based evaluation (if model available) # model = train_dynamics_model(offline_dataset) # results['model_eval'] = OfflineRLEvaluation.model_based_evaluation( # offline_dataset, policy, model # ) # 4. State coverage # results['coverage'] = OfflineRLEvaluation.state_coverage_metric( # offline_dataset, policy # ) return results # Usage offline_dataset = load_offline_dataset("d4rl-halfcheetah-medium-v0") offline_policy = CQL(offline_dataset) offline_policy.train() eval_results = OfflineRLEvaluation.full_offline_evaluation( offline_dataset, offline_policy ) print("Offline Evaluation (CONSERVATIVE):") print(f"In-distribution performance: {eval_results['in_dist_perf']}") print(f"BC baseline: {eval_results['bc_baseline']}") print("\nWARNING: These are lower bounds. True performance unknown without online evaluation.") ``` ### Staged Deployment for Offline RL **Best practice: Gradually introduce online evaluation** ```python def staged_offline_to_online_deployment(offline_policy, env): """ Staged deployment: Offline → Small online trial → Full deployment Stage 1: Offline evaluation (conservative) Stage 2: Small online trial (safety-constrained) Stage 3: Full online evaluation Stage 4: Deployment """ results = {} # Stage 1: Offline evaluation print("Stage 1: Offline evaluation") offline_perf = offline_evaluation(offline_policy) results['offline'] = offline_perf if offline_perf < minimum_threshold: print("Failed offline evaluation. Stop.") return results # Stage 2: Small online trial (100 episodes) print("Stage 2: Small online trial (100 episodes)") online_trial_perf = evaluate(offline_policy, env, episodes=100) results['small_trial'] = online_trial_perf # Check degradation degradation = (offline_perf - online_trial_perf) / offline_perf if degradation > 0.3: # >30% degradation print(f"WARNING: {degradation:.1%} performance drop in online trial") print("Policy may be overfitting to offline data. Investigate.") return results # Stage 3: Full online evaluation (1000 episodes) print("Stage 3: Full online evaluation (1000 episodes)") online_full_perf = evaluate(offline_policy, env, episodes=1000) results['full_online'] = online_full_perf # Stage 4: Deployment decision if online_full_perf > deployment_threshold: print("Passed all stages. Ready for deployment.") results['deploy'] = True else: print("Failed online evaluation. Do not deploy.") results['deploy'] = False return results ``` ## Common Pitfalls ### Pitfall 1: Single Seed Reporting **Symptom:** Reporting one training run as "the result" **Why it's wrong:** RL has high variance. Single seed is noise. **Detection:** - Paper shows single training curve - No variance/error bars - No mention of multiple seeds **Fix:** Minimum 5-10 seeds, report mean ± std ### Pitfall 2: Cherry-Picking Results **Symptom:** Running many experiments, reporting best **Why it's wrong:** Creates false positives (p-hacking) **Detection:** - Results seem too good - No mention of failed runs - "We tried many seeds and picked a representative one" **Fix:** Report ALL runs. Pre-register experiments. ### Pitfall 3: Evaluating on Training Set **Symptom:** Agent evaluated on same environment instances used for training **Why it's wrong:** Measures memorization, not generalization **Detection:** - No mention of train/test split - Same random seed for training and evaluation - Perfect performance on specific instances **Fix:** Separate train/test environments with different seeds ### Pitfall 4: Ignoring Sample Efficiency **Symptom:** Comparing algorithms only on final performance **Why it's wrong:** Final performance ignores cost to achieve it **Detection:** - No sample efficiency curves - No "reward at X steps" metrics - Only asymptotic performance reported **Fix:** Report sample efficiency curves, compare at multiple budgets ### Pitfall 5: Conflating Train and Eval Performance **Symptom:** Using training episode returns as evaluation **Why it's wrong:** Training uses exploration, evaluation should not **Detection:** - "Training reward" used for algorithm comparison - No separate evaluation protocol - Same environment instance for both **Fix:** Separate training (with exploration) and evaluation (without) ### Pitfall 6: Insufficient Evaluation Episodes **Symptom:** Evaluating with 5-10 episodes **Why it's wrong:** High variance → unreliable estimates **Detection:** - Large error bars - Inconsistent results across runs - SEM > 10% of mean **Fix:** 50-100 episodes minimum, power analysis for exact number ### Pitfall 7: Reporting Peak Instead of Final **Symptom:** Selecting best checkpoint during training **Why it's wrong:** Peak is overfitting to evaluation variance **Detection:** - "Best performance during training" reported - Early stopping based on eval performance - No mention of final performance **Fix:** Report final performance, or use validation set for model selection ### Pitfall 8: No Generalization Testing **Symptom:** Only evaluating on single environment configuration **Why it's wrong:** Doesn't measure robustness to distribution shift **Detection:** - No mention of distribution shifts - Only one environment configuration tested - No transfer/zero-shot evaluation **Fix:** Test on held-out environments, distribution shifts, adversarial cases ### Pitfall 9: Inconsistent Evaluation Mode **Symptom:** Comparing stochastic and deterministic evaluations **Why it's wrong:** Evaluation mode affects results by 10-30% **Detection:** - No mention of evaluation mode - Comparing algorithms with different modes - Unclear if sampling or mean action used **Fix:** Specify evaluation mode, ensure consistency across comparisons ### Pitfall 10: Offline RL Without Online Validation **Symptom:** Deploying offline RL policy based on Q-values alone **Why it's wrong:** Q-values extrapolate OOD, unreliable **Detection:** - No online rollouts before deployment - Claiming performance based on learned values - Ignoring distribution shift **Fix:** Staged deployment (offline → small online trial → full deployment) ## Red Flags | Red Flag | Implication | Action | |----------|-------------|--------| | Only one training curve shown | Single seed, cherry-picked | Demand multi-seed results | | No error bars or confidence intervals | No variance accounting | Require statistical rigor | | "We picked a representative seed" | Cherry-picking | Reject, require all seeds | | No train/test split mentioned | Likely overfitting | Check evaluation protocol | | No sample efficiency curves | Ignoring cost | Request curves or AUC | | Evaluation mode not specified | Unclear methodology | Ask: stochastic or deterministic? | | < 20 evaluation episodes | High variance | Require more episodes | | Only final performance reported | Missing sample efficiency | Request performance at multiple steps | | No generalization testing | Narrow evaluation | Request distribution shift tests | | Offline RL with no online validation | Unreliable estimates | Require online trial | | Results too good to be true | Probably cherry-picked or overfitting | Deep investigation | | p-value reported without effect size | Statistically significant but practically irrelevant | Check Cohen's d | ## Rationalization Table | Rationalization | Why It's Wrong | Counter | |-----------------|----------------|---------| | "RL papers commonly use single seed, so it's acceptable" | Common ≠ correct. Field is improving standards. | "Newer venues require multi-seed. Improve rigor." | | "Our algorithm is deterministic, variance is low" | Algorithm determinism ≠ environment/initialization determinism | "Environment randomness still causes variance." | | "We don't have compute for 10 seeds" | Then don't make strong performance claims | "Report 3-5 seeds with caveats, or wait for compute." | | "Evaluation on training set is faster" | Speed < correctness | "Fast wrong answer is worse than slow right answer." | | "We care about final performance, not sample efficiency" | Depends on application, often sample efficiency matters | "Clarify deployment constraints. Samples usually matter." | | "Stochastic/deterministic doesn't matter" | 10-30% difference is common | "Specify mode, ensure fair comparison." | | "10 eval episodes is enough" | Standard error likely > 10% of mean | "Compute SEM, use power analysis." | | "Our environment is simple, doesn't need generalization testing" | Deployment is rarely identical to training | "Test at least 2-3 distribution shifts." | | "Offline RL Q-values are accurate" | Only for in-distribution, not OOD | "Q-values extrapolate. Need online validation." | | "We reported the best run, but all were similar" | Then report all and show they're similar | "Show mean ± std to prove similarity." | ## Decision Trees ### Decision Tree 1: How Many Seeds? ``` What is the use case? ├─ Quick internal comparison │ └─ 3-5 seeds (caveat: preliminary results) ├─ Algorithm selection for production │ └─ 10-20 seeds ├─ Publication │ └─ 10-20 seeds (depends on venue) └─ Safety-critical deployment └─ 20-50 seeds (need tight confidence intervals) ``` ### Decision Tree 2: Evaluation Mode? ``` Is policy inherently deterministic? ├─ YES (DQN, deterministic policies) │ └─ Deterministic evaluation └─ NO (PPO, SAC, stochastic policies) ├─ Will deployment use stochastic policy? │ ├─ YES │ │ └─ Stochastic evaluation │ └─ NO │ └─ Deterministic evaluation └─ Unsure? └─ Report BOTH, explain trade-offs ``` ### Decision Tree 3: How Many Evaluation Episodes? ``` What is variance estimate? ├─ Unknown │ └─ Start with 20 episodes, estimate variance, use power analysis ├─ Known (σ) │ ├─ Low variance (σ < 0.1 * μ) │ │ └─ 20-50 episodes sufficient │ ├─ Medium variance (0.1 * μ ≤ σ < 0.3 * μ) │ │ └─ 50-100 episodes │ └─ High variance (σ ≥ 0.3 * μ) │ └─ 100-500 episodes (or use variance reduction techniques) ``` ### Decision Tree 4: Generalization Testing? ``` Is environment parameterized or procedurally generated? ├─ YES (multiple instances possible) │ ├─ Use train/test split (80/20) │ └─ Report both train and test performance └─ NO (single environment) ├─ Can you create distribution shifts? │ ├─ YES (modify dynamics, observations, etc.) │ │ └─ Test on 3-5 distribution shifts │ └─ NO │ └─ At minimum, use different random seed for eval ``` ### Decision Tree 5: Offline RL Evaluation? ``` Can you do online rollouts? ├─ YES │ └─ Use staged deployment (offline → small trial → full online) ├─ NO (completely offline) │ ├─ Use conservative offline metrics │ ├─ Compare to behavior cloning baseline │ ├─ Clearly state limitations │ └─ Do NOT claim actual performance, only lower bounds └─ PARTIAL (limited online budget) └─ Use model-based evaluation + small online trial ``` ## Workflow ### Standard Evaluation Workflow ``` 1. Pre-Experiment Planning ☐ Define evaluation protocol BEFORE running experiments ☐ Select number of seeds (minimum 5-10) ☐ Define train/test split if applicable ☐ Specify evaluation mode (stochastic/deterministic) ☐ Define sample budgets for efficiency curves ☐ Pre-register experiments (commit to protocol) 2. Training Phase ☐ Train on training environments ONLY ☐ Use separate eval environments with different seeds ☐ Evaluate at regular intervals (adaptive schedule) ☐ Save checkpoints at evaluation points ☐ Log both training and evaluation performance 3. Evaluation Phase ☐ Final evaluation on test set (never seen during training) ☐ Use sufficient episodes (50-100 minimum) ☐ Evaluate across all seeds ☐ Compute statistics (mean, std, CI, median, IQR) ☐ Test generalization (distribution shifts, zero-shot transfer) 4. Analysis Phase ☐ Compute sample efficiency metrics (AUC, reward at budgets) ☐ Statistical significance testing if comparing algorithms ☐ Check effect size (Cohen's d), not just p-value ☐ Identify failure cases and edge cases ☐ Measure robustness to perturbations 5. Reporting Phase ☐ Report all seeds, not selected subset ☐ Include mean ± std or 95% CI ☐ Show sample efficiency curves ☐ Report both training and generalization performance ☐ Specify evaluation mode ☐ Include negative results and failure analysis ☐ Provide reproducibility details (seeds, hyperparameters) ``` ### Checklist for Publication/Deployment ``` Statistical Rigor: ☐ Minimum 10 seeds ☐ Mean ± std or 95% CI reported ☐ Statistical significance testing (if comparing algorithms) ☐ Effect size reported (Cohen's d) Train/Test Discipline: ☐ Separate train/test environments ☐ Different random seeds for train/eval ☐ No evaluation on training data ☐ Generalization gap reported (train vs test performance) Comprehensive Metrics: ☐ Final performance ☐ Sample efficiency curves ☐ Performance at multiple sample budgets ☐ Evaluation mode specified (stochastic/deterministic) Generalization: ☐ Tested on distribution shifts ☐ Zero-shot transfer evaluation (if applicable) ☐ Robustness to perturbations Methodology: ☐ Sufficient evaluation episodes (50-100+) ☐ Evaluation protocol clearly described ☐ Reproducibility details provided ☐ Negative results included Offline RL (if applicable): ☐ Conservative offline metrics used ☐ Online validation included (or limitations clearly stated) ☐ Comparison to behavior cloning baseline ☐ Distribution shift acknowledged ``` ## Integration with rl-debugging RL evaluation and debugging are closely related: **Use rl-debugging when:** - Evaluation reveals poor performance - Need to diagnose WHY agent fails - Debugging training issues **Use rl-evaluation when:** - Agent seems to work, need to measure HOW WELL - Comparing multiple algorithms - Preparing for deployment **Combined workflow:** 1. Train agent 2. Evaluate (rl-evaluation skill) 3. If performance poor → Debug (rl-debugging skill) 4. Fix issues 5. Re-evaluate 6. Repeat until satisfactory 7. Final rigorous evaluation for deployment ## Summary RL evaluation is NOT just "run the agent and see what happens." It requires: 1. **Statistical rigor**: Multi-seed, confidence intervals, significance testing 2. **Train/test discipline**: Separate environments, no overfitting 3. **Comprehensive metrics**: Sample efficiency, generalization, robustness 4. **Appropriate protocols**: Evaluation mode, episode count, frequency 5. **Offline RL awareness**: Conservative estimates, staged deployment Without rigorous evaluation: - You will draw wrong conclusions from noise - You will deploy agents that fail in production - You will waste resources on false improvements - You will make scientifically invalid claims With rigorous evaluation: - Reliable performance estimates - Valid algorithm comparisons - Deployment-ready agents - Reproducible research **When in doubt:** More seeds, more episodes, more generalization tests. **END OF SKILL**