Simulation Training Environment

This chapter documents the training environment used for Asimov locomotion, including policy rate, observation timing, and actuator delay.

1. Training environment structure

The training stack is organized around a MuJoCo-based environment with:

200 Hz physics integration

200 Hz IO-side state handling

50 Hz policy execution

delayed actuator and observation paths

asymmetric actor-critic observations

passive toe dynamics in the plant model

The environment is intentionally designed to avoid an idealized control path.

In this chapter, IO-side state handling means the observation and actuator-side update loop. It carries raw motor-state timing, observation delay, and related control-path computations before the policy runs.

Representative physics settings inherited by the legs stack include:

Setting	Value
physics timestep	`5 ms`
policy decimation	`4`
policy rate	`50 Hz`
solver iterations	`10`

2. Observation timing and grouping

Joint observations are grouped to reflect the real CAN polling order. This means the policy does not receive all joint states as equally fresh data.

Observation group	Typical freshness
group 1	oldest
group 2	intermediate
group 3	freshest

Representative delay settings are:

Group	Delay range
group 1	`0-2` steps
group 2	`0-1` steps
group 3	`0` steps

This grouped delay structure is one of the main sim2real features of the stack.

In implementation terms, the oldest joint group reflects the earliest CAN reads, the middle group reflects intermediate bus timing, and the freshest group reflects the last motor-state packets available in the loop.

3. Observation noise

The observation model includes moderate noise terms that reflect sensor and estimation uncertainty.

Representative values include:

Quantity	Noise
IMU angular velocity	`+/-0.01`
projected gravity	`+/-0.05`
joint position	`+/-0.01 rad`
joint velocity	`+/-0.1 rad/s`

The goal is not to flood the policy with noise. The goal is to expose it to realistic sensing quality.

4. Built-in versus IO-rate actuator computation

One practical lesson from early experiments was that a pristine built-in actuator path can expose the policy to cleaner data than the real system will ever produce.

In contrast, the final control path intentionally allowed the policy to experience:

IO-rate computation at 200 Hz

slower policy output at 50 Hz

numerical roughness from the actual action and observation update path

This was important because the policy needed to tolerate the same class of stale, imperfect signals it would see during deployment.

5. Actuator interface

The action interface is a joint-space command interface over the 12 actuated leg joints.

Important characteristics:

DC motor actuator model with per-joint parameters

explicit actuator delay

torque-speed saturation

friction model

policy actions applied through a slower policy loop than the physics loop

Actuator delay settings and the full motor model are documented in Deep Dive: System Identification.

The action scaling rule preserved in the stack is:

action_scale = 0.30 * effort / stiffness

Another important implementation detail is that actuator damping comes from the controller path rather than from fixed XML damping on actuated joints. Default XML damping and friction-loss terms are removed for those joints to avoid double-counting dissipation.

6. Contact and toe observables

The environment tracks foot contact state, contact forces, foot air time, and toe position/velocity. These signals are used by the critic and reward system even when not exposed to the deployable actor. The toe model and its role in training are described in Deep Dive: System Identification.

Representative contact thresholds:

Quantity	Threshold
foot contact observation	`5 N` vertical-force threshold
reward-side contact helper	`10 N` threshold

7. Commands and operating envelope

The nominal command envelope for the legs locomotion stack is conservative:

Command	Range
`lin_vel_x`	`(-0.8, 0.8)`
`lin_vel_y`	`(-0.6, 0.6)`
`ang_vel_z`	`(-0.6, 0.6)`

This operating envelope is appropriate for early sim2real transfer and hardware bring-up.

8. Disturbances, resets, and play mode

The training environment also includes controlled perturbations and reset variation.

Representative settings include:

Item	Representative setting
push disturbance timing	every `1-3 s`
push magnitude	approximately `+/-0.5 m/s` in planar velocity, `+/-1.5 m/s` in pitch
reset yaw perturbation	`+/-pi`
reset pitch perturbation	`+/-0.15 rad`
reset roll perturbation	`+/-0.1 rad`
bad-orientation termination	approximately `45 deg` tilt

Play mode disables policy corruption and push disturbance while preserving the rest of the deployment-relevant control path.

9. PPO configuration

The training setup uses PPO with a conventional configuration.

Parameter	Value
learning rate	`1e-3`
gamma	`0.99`
lambda	`0.95`
clip parameter	`0.2`
entropy coefficient	`0.01`
learning epochs	`5`
mini-batches	`4`
desired KL	`0.01`
max grad norm	`1.0`
rollout length	`24`

The optimizer configuration is not the main differentiator of the stack. The environment fidelity and observation design are more important to transfer quality.

Simulation Training Environment

On this page