14 KiB
Vectorized Environments in Stable Baselines3
This document provides comprehensive information about vectorized environments in Stable Baselines3 for efficient parallel training.
Overview
Vectorized environments stack multiple independent environment instances into a single environment that processes actions and observations in batches. Instead of interacting with one environment at a time, you interact with n environments simultaneously.
Benefits:
- Speed: Parallel execution significantly accelerates training
- Sample efficiency: Collect more diverse experiences faster
- Required for: Frame stacking and normalization wrappers
- Better for: On-policy algorithms (PPO, A2C)
VecEnv Types
DummyVecEnv
Executes environments sequentially on the current Python process.
from stable_baselines3.common.vec_env import DummyVecEnv
# Method 1: Using make_vec_env
from stable_baselines3.common.env_util import make_vec_env
env = make_vec_env("CartPole-v1", n_envs=4, vec_env_cls=DummyVecEnv)
# Method 2: Manual creation
def make_env():
def _init():
return gym.make("CartPole-v1")
return _init
env = DummyVecEnv([make_env() for _ in range(4)])
When to use:
- Lightweight environments (CartPole, simple grids)
- When multiprocessing overhead > computation time
- Debugging (easier to trace errors)
- Single-threaded environments
Performance: No actual parallelism (sequential execution).
SubprocVecEnv
Executes each environment in a separate process, enabling true parallelism.
from stable_baselines3.common.vec_env import SubprocVecEnv
from stable_baselines3.common.env_util import make_vec_env
env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)
When to use:
- Computationally expensive environments (physics simulations, 3D games)
- When environment computation time justifies multiprocessing overhead
- When you need true parallel execution
Important: Requires wrapping code in if __name__ == "__main__": when using forkserver or spawn:
if __name__ == "__main__":
env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)
model = PPO("MlpPolicy", env)
model.learn(total_timesteps=100000)
Performance: True parallelism across CPU cores.
Quick Setup with make_vec_env
The easiest way to create vectorized environments:
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv
# Basic usage
env = make_vec_env("CartPole-v1", n_envs=4)
# With SubprocVecEnv
env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)
# With custom environment kwargs
env = make_vec_env(
"MyEnv-v0",
n_envs=4,
env_kwargs={"difficulty": "hard", "max_steps": 500}
)
# With custom seed
env = make_vec_env("CartPole-v1", n_envs=4, seed=42)
API Differences from Standard Gym
Vectorized environments have a different API than standard Gym environments:
reset()
Standard Gym:
obs, info = env.reset()
VecEnv:
obs = env.reset() # Returns only observations (numpy array)
# Access info via env.reset_infos (list of dicts)
infos = env.reset_infos
step()
Standard Gym:
obs, reward, terminated, truncated, info = env.step(action)
VecEnv:
obs, rewards, dones, infos = env.step(actions)
# Returns 4-tuple instead of 5-tuple
# dones = terminated | truncated
# actions is an array of shape (n_envs,) or (n_envs, action_dim)
Auto-reset
VecEnv automatically resets environments when episodes end:
obs = env.reset() # Shape: (n_envs, obs_dim)
for _ in range(1000):
actions = env.action_space.sample() # Shape: (n_envs,)
obs, rewards, dones, infos = env.step(actions)
# If dones[i] is True, env i was automatically reset
# Final observation before reset available in infos[i]["terminal_observation"]
Terminal Observations
When an episode ends, access the true final observation:
obs, rewards, dones, infos = env.step(actions)
for i, done in enumerate(dones):
if done:
# The obs[i] is already the reset observation
# True terminal observation is in info
terminal_obs = infos[i]["terminal_observation"]
print(f"Episode ended with terminal observation: {terminal_obs}")
Training with Vectorized Environments
On-Policy Algorithms (PPO, A2C)
On-policy algorithms benefit greatly from vectorization:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv
# Create vectorized environment
env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)
# Train
model = PPO("MlpPolicy", env, verbose=1, n_steps=128)
model.learn(total_timesteps=100000)
# With n_envs=8 and n_steps=128:
# - Collects 8*128=1024 steps per rollout
# - Updates after every 1024 steps
Rule of thumb: Use 4-16 parallel environments for on-policy methods.
Off-Policy Algorithms (SAC, TD3, DQN)
Off-policy algorithms can use vectorization but benefit less:
from stable_baselines3 import SAC
from stable_baselines3.common.env_util import make_vec_env
# Use fewer environments (1-4)
env = make_vec_env("Pendulum-v1", n_envs=4)
# Set gradient_steps=-1 for efficiency
model = SAC(
"MlpPolicy",
env,
verbose=1,
train_freq=1,
gradient_steps=-1, # Do 1 gradient step per env step (4 total with 4 envs)
)
model.learn(total_timesteps=50000)
Rule of thumb: Use 1-4 parallel environments for off-policy methods.
Wrappers for Vectorized Environments
VecNormalize
Normalizes observations and rewards using running statistics.
from stable_baselines3.common.vec_env import VecNormalize
env = make_vec_env("Pendulum-v1", n_envs=4)
# Wrap with normalization
env = VecNormalize(
env,
norm_obs=True, # Normalize observations
norm_reward=True, # Normalize rewards
clip_obs=10.0, # Clip normalized observations
clip_reward=10.0, # Clip normalized rewards
gamma=0.99, # Discount factor for reward normalization
)
# Train
model = PPO("MlpPolicy", env)
model.learn(total_timesteps=50000)
# Save model AND normalization statistics
model.save("ppo_pendulum")
env.save("vec_normalize.pkl")
# Load for evaluation
env = make_vec_env("Pendulum-v1", n_envs=1)
env = VecNormalize.load("vec_normalize.pkl", env)
env.training = False # Don't update stats during evaluation
env.norm_reward = False # Don't normalize rewards during evaluation
model = PPO.load("ppo_pendulum", env=env)
When to use:
- Continuous control tasks (especially MuJoCo)
- When observation scales vary widely
- When rewards have high variance
Important:
- Statistics are NOT saved with model - save separately
- Disable training and reward normalization during evaluation
VecFrameStack
Stacks observations from multiple consecutive frames.
from stable_baselines3.common.vec_env import VecFrameStack
env = make_vec_env("PongNoFrameskip-v4", n_envs=8)
# Stack 4 frames
env = VecFrameStack(env, n_stack=4)
# Now observations have shape: (n_envs, n_stack, height, width)
model = PPO("CnnPolicy", env)
model.learn(total_timesteps=1000000)
When to use:
- Atari games (stack 4 frames)
- Environments where velocity information is needed
- Partial observability problems
VecVideoRecorder
Records videos of agent behavior.
from stable_baselines3.common.vec_env import VecVideoRecorder
env = make_vec_env("CartPole-v1", n_envs=1)
# Record videos
env = VecVideoRecorder(
env,
video_folder="./videos/",
record_video_trigger=lambda x: x % 2000 == 0, # Record every 2000 steps
video_length=200, # Max video length
name_prefix="training"
)
model = PPO("MlpPolicy", env)
model.learn(total_timesteps=10000)
Output: MP4 videos in ./videos/ directory.
VecCheckNan
Checks for NaN or infinite values in observations and rewards.
from stable_baselines3.common.vec_env import VecCheckNan
env = make_vec_env("CustomEnv-v0", n_envs=4)
# Add NaN checking (useful for debugging)
env = VecCheckNan(env, raise_exception=True, warn_once=True)
model = PPO("MlpPolicy", env)
model.learn(total_timesteps=10000)
When to use:
- Debugging custom environments
- Catching numerical instabilities
- Validating environment implementation
VecTransposeImage
Transposes image observations from (height, width, channels) to (channels, height, width).
from stable_baselines3.common.vec_env import VecTransposeImage
env = make_vec_env("PongNoFrameskip-v4", n_envs=4)
# Convert HWC to CHW format
env = VecTransposeImage(env)
model = PPO("CnnPolicy", env)
When to use:
- When environment returns images in HWC format
- SB3 expects CHW format for CNN policies
Advanced Usage
Custom VecEnv
Create custom vectorized environment:
from stable_baselines3.common.vec_env import DummyVecEnv
import gymnasium as gym
class CustomVecEnv(DummyVecEnv):
def step_wait(self):
# Custom logic before/after stepping
obs, rewards, dones, infos = super().step_wait()
# Modify observations/rewards/etc
return obs, rewards, dones, infos
Environment Method Calls
Call methods on wrapped environments:
env = make_vec_env("MyEnv-v0", n_envs=4)
# Call method on all environments
env.env_method("set_difficulty", "hard")
# Call method on specific environment
env.env_method("reset_level", indices=[0, 2])
# Get attribute from all environments
levels = env.get_attr("current_level")
Setting Attributes
# Set attribute on all environments
env.set_attr("difficulty", "hard")
# Set attribute on specific environments
env.set_attr("max_steps", 1000, indices=[1, 3])
Performance Optimization
Choosing Number of Environments
On-Policy (PPO, A2C):
# General rule: 4-16 environments
# More environments = faster data collection
n_envs = 8
env = make_vec_env("CartPole-v1", n_envs=n_envs)
# Adjust n_steps to maintain same rollout length
# Total steps per rollout = n_envs * n_steps
model = PPO("MlpPolicy", env, n_steps=128) # 8*128 = 1024 steps/rollout
Off-Policy (SAC, TD3, DQN):
# General rule: 1-4 environments
# More doesn't help as much (replay buffer provides diversity)
n_envs = 4
env = make_vec_env("Pendulum-v1", n_envs=n_envs)
model = SAC("MlpPolicy", env, gradient_steps=-1) # 1 grad step per env step
CPU Core Utilization
import multiprocessing
# Use one less than total cores (leave one for Python main process)
n_cpus = multiprocessing.cpu_count() - 1
env = make_vec_env("MyEnv-v0", n_envs=n_cpus, vec_env_cls=SubprocVecEnv)
Memory Considerations
# Large replay buffer + many environments = high memory usage
# Reduce buffer size if memory constrained
model = SAC(
"MlpPolicy",
env,
buffer_size=100_000, # Reduced from 1M
)
Common Issues
Issue: "Can't pickle local object"
Cause: SubprocVecEnv requires picklable environments.
Solution: Define environment creation outside class/function:
# Bad
def train():
def make_env():
return gym.make("CartPole-v1")
env = SubprocVecEnv([make_env for _ in range(4)])
# Good
def make_env():
return gym.make("CartPole-v1")
if __name__ == "__main__":
env = SubprocVecEnv([make_env for _ in range(4)])
Issue: Different behavior between single and vectorized env
Cause: Auto-reset in vectorized environments.
Solution: Handle terminal observations correctly:
obs, rewards, dones, infos = env.step(actions)
for i, done in enumerate(dones):
if done:
terminal_obs = infos[i]["terminal_observation"]
# Process terminal_obs if needed
Issue: Slower with SubprocVecEnv than DummyVecEnv
Cause: Environment too lightweight (multiprocessing overhead > computation).
Solution: Use DummyVecEnv for simple environments:
# For CartPole, use DummyVecEnv
env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=DummyVecEnv)
Issue: Training crashes with SubprocVecEnv
Cause: Environment not properly isolated or has shared state.
Solution:
- Ensure environment has no shared global state
- Wrap code in
if __name__ == "__main__": - Use DummyVecEnv for debugging
Best Practices
-
Use appropriate VecEnv type:
- DummyVecEnv: Simple environments (CartPole, basic grids)
- SubprocVecEnv: Complex environments (MuJoCo, Unity, 3D games)
-
Adjust hyperparameters for vectorization:
- Divide
eval_freq,save_freqbyn_envsin callbacks - Maintain same
n_steps * n_envsfor on-policy algorithms
- Divide
-
Save normalization statistics:
- Always save VecNormalize stats with model
- Disable training during evaluation
-
Monitor memory usage:
- More environments = more memory
- Reduce buffer size if needed
-
Test with DummyVecEnv first:
- Easier debugging
- Ensure environment works before parallelizing
Examples
Basic Training Loop
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv
# Create vectorized environment
env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)
# Train
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100000)
# Evaluate
obs = env.reset()
for _ in range(1000):
action, _states = model.predict(obs, deterministic=True)
obs, rewards, dones, infos = env.step(action)
With Normalization
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecNormalize
# Create and normalize
env = make_vec_env("Pendulum-v1", n_envs=4)
env = VecNormalize(env, norm_obs=True, norm_reward=True)
# Train
model = PPO("MlpPolicy", env)
model.learn(total_timesteps=50000)
# Save both
model.save("model")
env.save("vec_normalize.pkl")
# Load for evaluation
eval_env = make_vec_env("Pendulum-v1", n_envs=1)
eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)
eval_env.training = False
eval_env.norm_reward = False
model = PPO.load("model", env=eval_env)
Additional Resources
- Official SB3 VecEnv Guide: https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html
- VecEnv API Reference: https://stable-baselines3.readthedocs.io/en/master/common/vec_env.html
- Multiprocessing Best Practices: https://docs.python.org/3/library/multiprocessing.html