🦵 PPO BipedalWalker-v3 [A100 SOTA / Score: 314.47]

This model is a state-of-the-art (SOTA) agent for BipedalWalker-v3, trained using Stable Baselines3 on an NVIDIA A100 GPU.

Achieving a mean score of 314.47 with extremely low variance (+/- 1.41), this agent demonstrates a highly optimized running gait that effectively solves the environment to its theoretical limit.

🏆 Performance Highlights

Metric	Score	Description
Mean Reward	314.47	Far exceeds the "solved" threshold of 300.
Variance	+/- 1.41	Extremely stable behavior. Consistent performance.
Training Speed	~4,000 FPS	Trained on NVIDIA A100 using 16 parallel environments.

🎥 Agent Behavior

(Please check the Files tab for the replay video if not displayed automatically)

Observation: The agent has converged on a low-center-of-mass "gliding" gait. This strategy minimizes head oscillation penalties and torque usage, allowing it to complete episodes rapidly (approx. 1000 steps) without falling. This is a common optimal policy for high-scoring agents in this specific environment.

⚡ Training Configuration (A100 Optimized)

The hyperparameters were tuned to saturate the A100's compute capacity (Massive Batch Size) while enforcing strict physical control.

# Key Hyperparameters for A100
n_envs = 16              # Parallel environments to feed the GPU
n_steps = 2048           
batch_size = 16384       # (1024 * 16) Massive batch for stable updates
learning_rate = 3e-4
ent_coef = 0.001         # Low entropy for precision control
clip_range = 0.18
gamma = 0.99
gae_lambda = 0.95
policy = "MlpPolicy"     # [256, 256]
device = "cuda"          # NVIDIA A100-SXM4-80GB

💻 Usage
import gymnasium as gym
from stable_baselines3 import PPO

# Load the SOTA model
model = PPO.load("beachcities/ppo-BipedalWalker-v3-A100-SOTA")

# Create environment
env = gym.make("BipedalWalker-v3", render_mode="human")

# Enjoy the perfect run
obs, _ = env.reset()
done = False
while not done:
    # Use deterministic=True for best performance
    action, _ = model.predict(obs, deterministic=True)
    obs, _, terminated, truncated, _ = env.step(action)
    done = terminated or truncated

Downloads last month: 37

Video Preview

Reinforcement Learning

Evaluation results

mean_reward on BipedalWalker-v3
self-reported

314.47 +/- 1.41