Simulation Training Environment
This chapter documents the training environment used for Asimov locomotion, including policy rate, observation timing, and actuator delay.
1. Training environment structure
The training stack is organized around a MuJoCo-based environment with:
- 200 Hz physics integration
- 200 Hz IO-side state handling
- 50 Hz policy execution
- delayed actuator and observation paths
- asymmetric actor-critic observations
- passive toe dynamics in the plant model
The environment is intentionally designed to avoid an idealized control path.
In this chapter, IO-side state handling means the observation and actuator-side update loop. It carries raw motor-state timing, observation delay, and related control-path computations before the policy runs.
Representative physics settings inherited by the legs stack include:
| Setting | Value |
|---|---|
| physics timestep | 5 ms |
| policy decimation | 4 |
| policy rate | 50 Hz |
| solver iterations | 10 |
2. Observation timing and grouping
Joint observations are grouped to reflect the real CAN polling order. This means the policy does not receive all joint states as equally fresh data.
| Observation group | Typical freshness |
|---|---|
| group 1 | oldest |
| group 2 | intermediate |
| group 3 | freshest |
Representative delay settings are:
| Group | Delay range |
|---|---|
| group 1 | 0-2 steps |
| group 2 | 0-1 steps |
| group 3 | 0 steps |
This grouped delay structure is one of the main sim2real features of the stack.
In implementation terms, the oldest joint group reflects the earliest CAN reads, the middle group reflects intermediate bus timing, and the freshest group reflects the last motor-state packets available in the loop.
3. Observation noise
The observation model includes moderate noise terms that reflect sensor and estimation uncertainty.
Representative values include:
| Quantity | Noise |
|---|---|
| IMU angular velocity | +/-0.01 |
| projected gravity | +/-0.05 |
| joint position | +/-0.01 rad |
| joint velocity | +/-0.1 rad/s |
The goal is not to flood the policy with noise. The goal is to expose it to realistic sensing quality.
4. Built-in versus IO-rate actuator computation
One practical lesson from early experiments was that a pristine built-in actuator path can expose the policy to cleaner data than the real system will ever produce.
In contrast, the final control path intentionally allowed the policy to experience:
- IO-rate computation at
200 Hz - slower policy output at
50 Hz - numerical roughness from the actual action and observation update path
This was important because the policy needed to tolerate the same class of stale, imperfect signals it would see during deployment.
5. Actuator interface
The action interface is a joint-space command interface over the 12 actuated leg joints.
Important characteristics:
- DC motor actuator model with per-joint parameters
- explicit actuator delay
- torque-speed saturation
- friction model
- policy actions applied through a slower policy loop than the physics loop
Actuator delay settings and the full motor model are documented in Deep Dive: System Identification.
The action scaling rule preserved in the stack is:
action_scale = 0.30 * effort / stiffness
Another important implementation detail is that actuator damping comes from the controller path rather than from fixed XML damping on actuated joints. Default XML damping and friction-loss terms are removed for those joints to avoid double-counting dissipation.
6. Contact and toe observables
The environment tracks foot contact state, contact forces, foot air time, and toe position/velocity. These signals are used by the critic and reward system even when not exposed to the deployable actor. The toe model and its role in training are described in Deep Dive: System Identification.
Representative contact thresholds:
| Quantity | Threshold |
|---|---|
| foot contact observation | 5 N vertical-force threshold |
| reward-side contact helper | 10 N threshold |
7. Commands and operating envelope
The nominal command envelope for the legs locomotion stack is conservative:
| Command | Range |
|---|---|
lin_vel_x | (-0.8, 0.8) |
lin_vel_y | (-0.6, 0.6) |
ang_vel_z | (-0.6, 0.6) |
This operating envelope is appropriate for early sim2real transfer and hardware bring-up.
8. Disturbances, resets, and play mode
The training environment also includes controlled perturbations and reset variation.
Representative settings include:
| Item | Representative setting |
|---|---|
| push disturbance timing | every 1-3 s |
| push magnitude | approximately +/-0.5 m/s in planar velocity, +/-1.5 m/s in pitch |
| reset yaw perturbation | +/-pi |
| reset pitch perturbation | +/-0.15 rad |
| reset roll perturbation | +/-0.1 rad |
| bad-orientation termination | approximately 45 deg tilt |
Play mode disables policy corruption and push disturbance while preserving the rest of the deployment-relevant control path.
9. PPO configuration
The training setup uses PPO with a conventional configuration.
| Parameter | Value |
|---|---|
| learning rate | 1e-3 |
| gamma | 0.99 |
| lambda | 0.95 |
| clip parameter | 0.2 |
| entropy coefficient | 0.01 |
| learning epochs | 5 |
| mini-batches | 4 |
| desired KL | 0.01 |
| max grad norm | 1.0 |
| rollout length | 24 |
The optimizer configuration is not the main differentiator of the stack. The environment fidelity and observation design are more important to transfer quality.
How is this guide?