zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

14 KiB

Raw Permalink Blame History

Vectorized Environments in Stable Baselines3

This document provides comprehensive information about vectorized environments in Stable Baselines3 for efficient parallel training.

Overview

Vectorized environments stack multiple independent environment instances into a single environment that processes actions and observations in batches. Instead of interacting with one environment at a time, you interact with n environments simultaneously.

Benefits:

Speed: Parallel execution significantly accelerates training
Sample efficiency: Collect more diverse experiences faster
Required for: Frame stacking and normalization wrappers
Better for: On-policy algorithms (PPO, A2C)

VecEnv Types

DummyVecEnv

Executes environments sequentially on the current Python process.

from stable_baselines3.common.vec_env import DummyVecEnv

# Method 1: Using make_vec_env
from stable_baselines3.common.env_util import make_vec_env

env = make_vec_env("CartPole-v1", n_envs=4, vec_env_cls=DummyVecEnv)

# Method 2: Manual creation
def make_env():
    def _init():
        return gym.make("CartPole-v1")
    return _init

env = DummyVecEnv([make_env() for _ in range(4)])

When to use:

Lightweight environments (CartPole, simple grids)
When multiprocessing overhead > computation time
Debugging (easier to trace errors)
Single-threaded environments

Performance: No actual parallelism (sequential execution).

SubprocVecEnv

Executes each environment in a separate process, enabling true parallelism.

from stable_baselines3.common.vec_env import SubprocVecEnv
from stable_baselines3.common.env_util import make_vec_env

env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)

When to use:

Computationally expensive environments (physics simulations, 3D games)
When environment computation time justifies multiprocessing overhead
When you need true parallel execution

Important: Requires wrapping code in if __name__ == "__main__": when using forkserver or spawn:

if __name__ == "__main__":
    env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)
    model = PPO("MlpPolicy", env)
    model.learn(total_timesteps=100000)

Performance: True parallelism across CPU cores.

Quick Setup with make_vec_env

The easiest way to create vectorized environments:

from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv

# Basic usage
env = make_vec_env("CartPole-v1", n_envs=4)

# With SubprocVecEnv
env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)

# With custom environment kwargs
env = make_vec_env(
    "MyEnv-v0",
    n_envs=4,
    env_kwargs={"difficulty": "hard", "max_steps": 500}
)

# With custom seed
env = make_vec_env("CartPole-v1", n_envs=4, seed=42)

API Differences from Standard Gym

Vectorized environments have a different API than standard Gym environments:

reset()

Standard Gym:

obs, info = env.reset()

VecEnv:

obs = env.reset()  # Returns only observations (numpy array)
# Access info via env.reset_infos (list of dicts)
infos = env.reset_infos

step()

Standard Gym:

obs, reward, terminated, truncated, info = env.step(action)

VecEnv:

obs, rewards, dones, infos = env.step(actions)
# Returns 4-tuple instead of 5-tuple
# dones = terminated | truncated
# actions is an array of shape (n_envs,) or (n_envs, action_dim)

Auto-reset

VecEnv automatically resets environments when episodes end:

obs = env.reset()  # Shape: (n_envs, obs_dim)
for _ in range(1000):
    actions = env.action_space.sample()  # Shape: (n_envs,)
    obs, rewards, dones, infos = env.step(actions)
    # If dones[i] is True, env i was automatically reset
    # Final observation before reset available in infos[i]["terminal_observation"]

Terminal Observations

When an episode ends, access the true final observation:

obs, rewards, dones, infos = env.step(actions)

for i, done in enumerate(dones):
    if done:
        # The obs[i] is already the reset observation
        # True terminal observation is in info
        terminal_obs = infos[i]["terminal_observation"]
        print(f"Episode ended with terminal observation: {terminal_obs}")

Training with Vectorized Environments

On-Policy Algorithms (PPO, A2C)

On-policy algorithms benefit greatly from vectorization:

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv

# Create vectorized environment
env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)

# Train
model = PPO("MlpPolicy", env, verbose=1, n_steps=128)
model.learn(total_timesteps=100000)

# With n_envs=8 and n_steps=128:
# - Collects 8*128=1024 steps per rollout
# - Updates after every 1024 steps

Rule of thumb: Use 4-16 parallel environments for on-policy methods.

Off-Policy Algorithms (SAC, TD3, DQN)

Off-policy algorithms can use vectorization but benefit less:

from stable_baselines3 import SAC
from stable_baselines3.common.env_util import make_vec_env

# Use fewer environments (1-4)
env = make_vec_env("Pendulum-v1", n_envs=4)

# Set gradient_steps=-1 for efficiency
model = SAC(
    "MlpPolicy",
    env,
    verbose=1,
    train_freq=1,
    gradient_steps=-1,  # Do 1 gradient step per env step (4 total with 4 envs)
)
model.learn(total_timesteps=50000)

Rule of thumb: Use 1-4 parallel environments for off-policy methods.

Wrappers for Vectorized Environments

VecNormalize

Normalizes observations and rewards using running statistics.

from stable_baselines3.common.vec_env import VecNormalize

env = make_vec_env("Pendulum-v1", n_envs=4)

# Wrap with normalization
env = VecNormalize(
    env,
    norm_obs=True,        # Normalize observations
    norm_reward=True,     # Normalize rewards
    clip_obs=10.0,        # Clip normalized observations
    clip_reward=10.0,     # Clip normalized rewards
    gamma=0.99,           # Discount factor for reward normalization
)

# Train
model = PPO("MlpPolicy", env)
model.learn(total_timesteps=50000)

# Save model AND normalization statistics
model.save("ppo_pendulum")
env.save("vec_normalize.pkl")

# Load for evaluation
env = make_vec_env("Pendulum-v1", n_envs=1)
env = VecNormalize.load("vec_normalize.pkl", env)
env.training = False  # Don't update stats during evaluation
env.norm_reward = False  # Don't normalize rewards during evaluation

model = PPO.load("ppo_pendulum", env=env)

When to use:

Continuous control tasks (especially MuJoCo)
When observation scales vary widely
When rewards have high variance

Important:

Statistics are NOT saved with model - save separately
Disable training and reward normalization during evaluation

VecFrameStack

Stacks observations from multiple consecutive frames.

from stable_baselines3.common.vec_env import VecFrameStack

env = make_vec_env("PongNoFrameskip-v4", n_envs=8)

# Stack 4 frames
env = VecFrameStack(env, n_stack=4)

# Now observations have shape: (n_envs, n_stack, height, width)
model = PPO("CnnPolicy", env)
model.learn(total_timesteps=1000000)

When to use:

Atari games (stack 4 frames)
Environments where velocity information is needed
Partial observability problems

VecVideoRecorder

Records videos of agent behavior.

from stable_baselines3.common.vec_env import VecVideoRecorder

env = make_vec_env("CartPole-v1", n_envs=1)

# Record videos
env = VecVideoRecorder(
    env,
    video_folder="./videos/",
    record_video_trigger=lambda x: x % 2000 == 0,  # Record every 2000 steps
    video_length=200,  # Max video length
    name_prefix="training"
)

model = PPO("MlpPolicy", env)
model.learn(total_timesteps=10000)

Output: MP4 videos in ./videos/ directory.

VecCheckNan

Checks for NaN or infinite values in observations and rewards.

from stable_baselines3.common.vec_env import VecCheckNan

env = make_vec_env("CustomEnv-v0", n_envs=4)

# Add NaN checking (useful for debugging)
env = VecCheckNan(env, raise_exception=True, warn_once=True)

model = PPO("MlpPolicy", env)
model.learn(total_timesteps=10000)

When to use:

Debugging custom environments
Catching numerical instabilities
Validating environment implementation

VecTransposeImage

Transposes image observations from (height, width, channels) to (channels, height, width).

from stable_baselines3.common.vec_env import VecTransposeImage

env = make_vec_env("PongNoFrameskip-v4", n_envs=4)

# Convert HWC to CHW format
env = VecTransposeImage(env)

model = PPO("CnnPolicy", env)

When to use:

When environment returns images in HWC format
SB3 expects CHW format for CNN policies

Advanced Usage

Custom VecEnv

Create custom vectorized environment:

from stable_baselines3.common.vec_env import DummyVecEnv
import gymnasium as gym

class CustomVecEnv(DummyVecEnv):
    def step_wait(self):
        # Custom logic before/after stepping
        obs, rewards, dones, infos = super().step_wait()
        # Modify observations/rewards/etc
        return obs, rewards, dones, infos

Environment Method Calls

Call methods on wrapped environments:

env = make_vec_env("MyEnv-v0", n_envs=4)

# Call method on all environments
env.env_method("set_difficulty", "hard")

# Call method on specific environment
env.env_method("reset_level", indices=[0, 2])

# Get attribute from all environments
levels = env.get_attr("current_level")

Setting Attributes

# Set attribute on all environments
env.set_attr("difficulty", "hard")

# Set attribute on specific environments
env.set_attr("max_steps", 1000, indices=[1, 3])

Performance Optimization

Choosing Number of Environments

On-Policy (PPO, A2C):

# General rule: 4-16 environments
# More environments = faster data collection
n_envs = 8
env = make_vec_env("CartPole-v1", n_envs=n_envs)

# Adjust n_steps to maintain same rollout length
# Total steps per rollout = n_envs * n_steps
model = PPO("MlpPolicy", env, n_steps=128)  # 8*128 = 1024 steps/rollout

Off-Policy (SAC, TD3, DQN):

# General rule: 1-4 environments
# More doesn't help as much (replay buffer provides diversity)
n_envs = 4
env = make_vec_env("Pendulum-v1", n_envs=n_envs)

model = SAC("MlpPolicy", env, gradient_steps=-1)  # 1 grad step per env step

CPU Core Utilization

import multiprocessing

# Use one less than total cores (leave one for Python main process)
n_cpus = multiprocessing.cpu_count() - 1
env = make_vec_env("MyEnv-v0", n_envs=n_cpus, vec_env_cls=SubprocVecEnv)

Memory Considerations

# Large replay buffer + many environments = high memory usage
# Reduce buffer size if memory constrained
model = SAC(
    "MlpPolicy",
    env,
    buffer_size=100_000,  # Reduced from 1M
)

Common Issues

Issue: "Can't pickle local object"

Cause: SubprocVecEnv requires picklable environments.

Solution: Define environment creation outside class/function:

# Bad
def train():
    def make_env():
        return gym.make("CartPole-v1")
    env = SubprocVecEnv([make_env for _ in range(4)])

# Good
def make_env():
    return gym.make("CartPole-v1")

if __name__ == "__main__":
    env = SubprocVecEnv([make_env for _ in range(4)])

Issue: Different behavior between single and vectorized env

Cause: Auto-reset in vectorized environments.

Solution: Handle terminal observations correctly:

obs, rewards, dones, infos = env.step(actions)
for i, done in enumerate(dones):
    if done:
        terminal_obs = infos[i]["terminal_observation"]
        # Process terminal_obs if needed

Issue: Slower with SubprocVecEnv than DummyVecEnv

Cause: Environment too lightweight (multiprocessing overhead > computation).

Solution: Use DummyVecEnv for simple environments:

# For CartPole, use DummyVecEnv
env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=DummyVecEnv)

Issue: Training crashes with SubprocVecEnv

Cause: Environment not properly isolated or has shared state.

Solution:

Ensure environment has no shared global state
Wrap code in if __name__ == "__main__":
Use DummyVecEnv for debugging

Best Practices

Use appropriate VecEnv type:
- DummyVecEnv: Simple environments (CartPole, basic grids)
- SubprocVecEnv: Complex environments (MuJoCo, Unity, 3D games)
Adjust hyperparameters for vectorization:
- Divide eval_freq, save_freq by n_envs in callbacks
- Maintain same n_steps * n_envs for on-policy algorithms
Save normalization statistics:
- Always save VecNormalize stats with model
- Disable training during evaluation
Monitor memory usage:
- More environments = more memory
- Reduce buffer size if needed
Test with DummyVecEnv first:
- Easier debugging
- Ensure environment works before parallelizing

Examples

Basic Training Loop

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv

# Create vectorized environment
env = make_vec_env("CartPole-v1", n_envs=8, vec_env_cls=SubprocVecEnv)

# Train
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100000)

# Evaluate
obs = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, infos = env.step(action)

With Normalization

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecNormalize

# Create and normalize
env = make_vec_env("Pendulum-v1", n_envs=4)
env = VecNormalize(env, norm_obs=True, norm_reward=True)

# Train
model = PPO("MlpPolicy", env)
model.learn(total_timesteps=50000)

# Save both
model.save("model")
env.save("vec_normalize.pkl")

# Load for evaluation
eval_env = make_vec_env("Pendulum-v1", n_envs=1)
eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)
eval_env.training = False
eval_env.norm_reward = False

model = PPO.load("model", env=eval_env)

Additional Resources

Official SB3 VecEnv Guide: https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html
VecEnv API Reference: https://stable-baselines3.readthedocs.io/en/master/common/vec_env.html
Multiprocessing Best Practices: https://docs.python.org/3/library/multiprocessing.html

14 KiB Raw Permalink Blame History

Vectorized Environments in Stable Baselines3

Overview

VecEnv Types

DummyVecEnv

SubprocVecEnv

Quick Setup with make_vec_env

API Differences from Standard Gym

reset()

step()

Auto-reset

Terminal Observations

Training with Vectorized Environments

On-Policy Algorithms (PPO, A2C)

Off-Policy Algorithms (SAC, TD3, DQN)

Wrappers for Vectorized Environments

VecNormalize

VecFrameStack

VecVideoRecorder

VecCheckNan

VecTransposeImage

Advanced Usage

Custom VecEnv

Environment Method Calls

Setting Attributes

Performance Optimization

Choosing Number of Environments

CPU Core Utilization

Memory Considerations

Common Issues

Issue: "Can't pickle local object"

Issue: Different behavior between single and vectorized env

Issue: Slower with SubprocVecEnv than DummyVecEnv

Issue: Training crashes with SubprocVecEnv

Best Practices

Examples

Basic Training Loop

With Normalization

Additional Resources

14 KiB

Raw Permalink Blame History