Files
gh-k-dense-ai-claude-scient…/skills/stable-baselines3/references/vectorized_envs.md
2025-11-30 08:30:10 +08:00

14 KiB

Vectorized Environments in Stable Baselines3

This document provides comprehensive information about vectorized environments in Stable Baselines3 for efficient parallel training.

Overview

Vectorized environments stack multiple independent environment instances into a single environment that processes actions and observations in batches. Instead of interacting with one environment at a time, you interact with n environments simultaneously.

Benefits:

  • Speed: Parallel execution significantly accelerates training
  • Sample efficiency: Collect more diverse experiences faster
  • Required for: Frame stacking and normalization wrappers
  • Better for: On-policy algorithms (PPO, A2C)

VecEnv Types

DummyVecEnv

Executes environments sequentially on the current Python process.

from stable_baselines3.common.vec_env import DummyVecEnv

# Method 1: Using make_vec_env
from stable_baselines3.common.env_util import make_vec_env

env = make_vec_env("CartPole-v1", n_envs=4, vec_env_cls=DummyVecEnv)

# Method 2: Manual creation
def make_env():
    def _init():
        return gym.make("CartPole-v1")
    return _init

env = DummyVecEnv([make_env() for _ in range(4)])

When to use:

  • Lightweight environments (CartPole, simple grids)
  • When multiprocessing overhead > computation time
  • Debugging (easier to trace errors)
  • Single-threaded environments

Performance: No actual parallelism (sequential execution).

SubprocVecEnv

Executes each environment in a separate process, enabling true parallelism.

from stable_baselines3.common.vec_env import SubprocVecEnv
from stable_baselines3.common.env_util import make_vec_env

env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)

When to use:

  • Computationally expensive environments (physics simulations, 3D games)
  • When environment computation time justifies multiprocessing overhead
  • When you need true parallel execution

Important: Requires wrapping code in if __name__ == "__main__": when using forkserver or spawn:

if __name__ == "__main__":
    env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)
    model = PPO("MlpPolicy", env)
    model.learn(total_timesteps=100000)

Performance: True parallelism across CPU cores.

Quick Setup with make_vec_env

The easiest way to create vectorized environments:

from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv

# Basic usage
env = make_vec_env("CartPole-v1", n_envs=4)

# With SubprocVecEnv
env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)

# With custom environment kwargs
env = make_vec_env(
    "MyEnv-v0",
    n_envs=4,
    env_kwargs={"difficulty": "hard", "max_steps": 500}
)

# With custom seed
env = make_vec_env("CartPole-v1", n_envs=4, seed=42)

API Differences from Standard Gym

Vectorized environments have a different API than standard Gym environments:

reset()

Standard Gym:

obs, info = env.reset()

VecEnv:

obs = env.reset()  # Returns only observations (numpy array)
# Access info via env.reset_infos (list of dicts)
infos = env.reset_infos

step()

Standard Gym:

obs, reward, terminated, truncated, info = env.step(action)

VecEnv:

obs, rewards, dones, infos = env.step(actions)
# Returns 4-tuple instead of 5-tuple
# dones = terminated | truncated
# actions is an array of shape (n_envs,) or (n_envs, action_dim)

Auto-reset

VecEnv automatically resets environments when episodes end:

obs = env.reset()  # Shape: (n_envs, obs_dim)
for _ in range(1000):
    actions = env.action_space.sample()  # Shape: (n_envs,)
    obs, rewards, dones, infos = env.step(actions)
    # If dones[i] is True, env i was automatically reset
    # Final observation before reset available in infos[i]["terminal_observation"]

Terminal Observations

When an episode ends, access the true final observation:

obs, rewards, dones, infos = env.step(actions)

for i, done in enumerate(dones):
    if done:
        # The obs[i] is already the reset observation
        # True terminal observation is in info
        terminal_obs = infos[i]["terminal_observation"]
        print(f"Episode ended with terminal observation: {terminal_obs}")

Training with Vectorized Environments

On-Policy Algorithms (PPO, A2C)

On-policy algorithms benefit greatly from vectorization:

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv

# Create vectorized environment
env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)

# Train
model = PPO("MlpPolicy", env, verbose=1, n_steps=128)
model.learn(total_timesteps=100000)

# With n_envs=8 and n_steps=128:
# - Collects 8*128=1024 steps per rollout
# - Updates after every 1024 steps

Rule of thumb: Use 4-16 parallel environments for on-policy methods.

Off-Policy Algorithms (SAC, TD3, DQN)

Off-policy algorithms can use vectorization but benefit less:

from stable_baselines3 import SAC
from stable_baselines3.common.env_util import make_vec_env

# Use fewer environments (1-4)
env = make_vec_env("Pendulum-v1", n_envs=4)

# Set gradient_steps=-1 for efficiency
model = SAC(
    "MlpPolicy",
    env,
    verbose=1,
    train_freq=1,
    gradient_steps=-1,  # Do 1 gradient step per env step (4 total with 4 envs)
)
model.learn(total_timesteps=50000)

Rule of thumb: Use 1-4 parallel environments for off-policy methods.

Wrappers for Vectorized Environments

VecNormalize

Normalizes observations and rewards using running statistics.

from stable_baselines3.common.vec_env import VecNormalize

env = make_vec_env("Pendulum-v1", n_envs=4)

# Wrap with normalization
env = VecNormalize(
    env,
    norm_obs=True,        # Normalize observations
    norm_reward=True,     # Normalize rewards
    clip_obs=10.0,        # Clip normalized observations
    clip_reward=10.0,     # Clip normalized rewards
    gamma=0.99,           # Discount factor for reward normalization
)

# Train
model = PPO("MlpPolicy", env)
model.learn(total_timesteps=50000)

# Save model AND normalization statistics
model.save("ppo_pendulum")
env.save("vec_normalize.pkl")

# Load for evaluation
env = make_vec_env("Pendulum-v1", n_envs=1)
env = VecNormalize.load("vec_normalize.pkl", env)
env.training = False  # Don't update stats during evaluation
env.norm_reward = False  # Don't normalize rewards during evaluation

model = PPO.load("ppo_pendulum", env=env)

When to use:

  • Continuous control tasks (especially MuJoCo)
  • When observation scales vary widely
  • When rewards have high variance

Important:

  • Statistics are NOT saved with model - save separately
  • Disable training and reward normalization during evaluation

VecFrameStack

Stacks observations from multiple consecutive frames.

from stable_baselines3.common.vec_env import VecFrameStack

env = make_vec_env("PongNoFrameskip-v4", n_envs=8)

# Stack 4 frames
env = VecFrameStack(env, n_stack=4)

# Now observations have shape: (n_envs, n_stack, height, width)
model = PPO("CnnPolicy", env)
model.learn(total_timesteps=1000000)

When to use:

  • Atari games (stack 4 frames)
  • Environments where velocity information is needed
  • Partial observability problems

VecVideoRecorder

Records videos of agent behavior.

from stable_baselines3.common.vec_env import VecVideoRecorder

env = make_vec_env("CartPole-v1", n_envs=1)

# Record videos
env = VecVideoRecorder(
    env,
    video_folder="./videos/",
    record_video_trigger=lambda x: x % 2000 == 0,  # Record every 2000 steps
    video_length=200,  # Max video length
    name_prefix="training"
)

model = PPO("MlpPolicy", env)
model.learn(total_timesteps=10000)

Output: MP4 videos in ./videos/ directory.

VecCheckNan

Checks for NaN or infinite values in observations and rewards.

from stable_baselines3.common.vec_env import VecCheckNan

env = make_vec_env("CustomEnv-v0", n_envs=4)

# Add NaN checking (useful for debugging)
env = VecCheckNan(env, raise_exception=True, warn_once=True)

model = PPO("MlpPolicy", env)
model.learn(total_timesteps=10000)

When to use:

  • Debugging custom environments
  • Catching numerical instabilities
  • Validating environment implementation

VecTransposeImage

Transposes image observations from (height, width, channels) to (channels, height, width).

from stable_baselines3.common.vec_env import VecTransposeImage

env = make_vec_env("PongNoFrameskip-v4", n_envs=4)

# Convert HWC to CHW format
env = VecTransposeImage(env)

model = PPO("CnnPolicy", env)

When to use:

  • When environment returns images in HWC format
  • SB3 expects CHW format for CNN policies

Advanced Usage

Custom VecEnv

Create custom vectorized environment:

from stable_baselines3.common.vec_env import DummyVecEnv
import gymnasium as gym

class CustomVecEnv(DummyVecEnv):
    def step_wait(self):
        # Custom logic before/after stepping
        obs, rewards, dones, infos = super().step_wait()
        # Modify observations/rewards/etc
        return obs, rewards, dones, infos

Environment Method Calls

Call methods on wrapped environments:

env = make_vec_env("MyEnv-v0", n_envs=4)

# Call method on all environments
env.env_method("set_difficulty", "hard")

# Call method on specific environment
env.env_method("reset_level", indices=[0, 2])

# Get attribute from all environments
levels = env.get_attr("current_level")

Setting Attributes

# Set attribute on all environments
env.set_attr("difficulty", "hard")

# Set attribute on specific environments
env.set_attr("max_steps", 1000, indices=[1, 3])

Performance Optimization

Choosing Number of Environments

On-Policy (PPO, A2C):

# General rule: 4-16 environments
# More environments = faster data collection
n_envs = 8
env = make_vec_env("CartPole-v1", n_envs=n_envs)

# Adjust n_steps to maintain same rollout length
# Total steps per rollout = n_envs * n_steps
model = PPO("MlpPolicy", env, n_steps=128)  # 8*128 = 1024 steps/rollout

Off-Policy (SAC, TD3, DQN):

# General rule: 1-4 environments
# More doesn't help as much (replay buffer provides diversity)
n_envs = 4
env = make_vec_env("Pendulum-v1", n_envs=n_envs)

model = SAC("MlpPolicy", env, gradient_steps=-1)  # 1 grad step per env step

CPU Core Utilization

import multiprocessing

# Use one less than total cores (leave one for Python main process)
n_cpus = multiprocessing.cpu_count() - 1
env = make_vec_env("MyEnv-v0", n_envs=n_cpus, vec_env_cls=SubprocVecEnv)

Memory Considerations

# Large replay buffer + many environments = high memory usage
# Reduce buffer size if memory constrained
model = SAC(
    "MlpPolicy",
    env,
    buffer_size=100_000,  # Reduced from 1M
)

Common Issues

Issue: "Can't pickle local object"

Cause: SubprocVecEnv requires picklable environments.

Solution: Define environment creation outside class/function:

# Bad
def train():
    def make_env():
        return gym.make("CartPole-v1")
    env = SubprocVecEnv([make_env for _ in range(4)])

# Good
def make_env():
    return gym.make("CartPole-v1")

if __name__ == "__main__":
    env = SubprocVecEnv([make_env for _ in range(4)])

Issue: Different behavior between single and vectorized env

Cause: Auto-reset in vectorized environments.

Solution: Handle terminal observations correctly:

obs, rewards, dones, infos = env.step(actions)
for i, done in enumerate(dones):
    if done:
        terminal_obs = infos[i]["terminal_observation"]
        # Process terminal_obs if needed

Issue: Slower with SubprocVecEnv than DummyVecEnv

Cause: Environment too lightweight (multiprocessing overhead > computation).

Solution: Use DummyVecEnv for simple environments:

# For CartPole, use DummyVecEnv
env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=DummyVecEnv)

Issue: Training crashes with SubprocVecEnv

Cause: Environment not properly isolated or has shared state.

Solution:

  • Ensure environment has no shared global state
  • Wrap code in if __name__ == "__main__":
  • Use DummyVecEnv for debugging

Best Practices

  1. Use appropriate VecEnv type:

    • DummyVecEnv: Simple environments (CartPole, basic grids)
    • SubprocVecEnv: Complex environments (MuJoCo, Unity, 3D games)
  2. Adjust hyperparameters for vectorization:

    • Divide eval_freq, save_freq by n_envs in callbacks
    • Maintain same n_steps * n_envs for on-policy algorithms
  3. Save normalization statistics:

    • Always save VecNormalize stats with model
    • Disable training during evaluation
  4. Monitor memory usage:

    • More environments = more memory
    • Reduce buffer size if needed
  5. Test with DummyVecEnv first:

    • Easier debugging
    • Ensure environment works before parallelizing

Examples

Basic Training Loop

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv

# Create vectorized environment
env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)

# Train
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100000)

# Evaluate
obs = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, infos = env.step(action)

With Normalization

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecNormalize

# Create and normalize
env = make_vec_env("Pendulum-v1", n_envs=4)
env = VecNormalize(env, norm_obs=True, norm_reward=True)

# Train
model = PPO("MlpPolicy", env)
model.learn(total_timesteps=50000)

# Save both
model.save("model")
env.save("vec_normalize.pkl")

# Load for evaluation
eval_env = make_vec_env("Pendulum-v1", n_envs=1)
eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)
eval_env.training = False
eval_env.norm_reward = False

model = PPO.load("model", env=eval_env)

Additional Resources