Menlo
Locomotion Training

Simulation Training Environment

This chapter documents the training environment used for Asimov locomotion, including policy rate, observation timing, and actuator delay.

1. Training environment structure

The training stack is organized around a MuJoCo-based environment with:

  • 200 Hz physics integration
  • 200 Hz IO-side state handling
  • 50 Hz policy execution
  • delayed actuator and observation paths
  • asymmetric actor-critic observations
  • passive toe dynamics in the plant model

The environment is intentionally designed to avoid an idealized control path.

In this chapter, IO-side state handling means the observation and actuator-side update loop. It carries raw motor-state timing, observation delay, and related control-path computations before the policy runs.

Representative physics settings inherited by the legs stack include:

SettingValue
physics timestep5 ms
policy decimation4
policy rate50 Hz
solver iterations10

2. Observation timing and grouping

Joint observations are grouped to reflect the real CAN polling order. This means the policy does not receive all joint states as equally fresh data.

Observation groupTypical freshness
group 1oldest
group 2intermediate
group 3freshest

Representative delay settings are:

GroupDelay range
group 10-2 steps
group 20-1 steps
group 30 steps

This grouped delay structure is one of the main sim2real features of the stack.

In implementation terms, the oldest joint group reflects the earliest CAN reads, the middle group reflects intermediate bus timing, and the freshest group reflects the last motor-state packets available in the loop.

3. Observation noise

The observation model includes moderate noise terms that reflect sensor and estimation uncertainty.

Representative values include:

QuantityNoise
IMU angular velocity+/-0.01
projected gravity+/-0.05
joint position+/-0.01 rad
joint velocity+/-0.1 rad/s

The goal is not to flood the policy with noise. The goal is to expose it to realistic sensing quality.

4. Built-in versus IO-rate actuator computation

One practical lesson from early experiments was that a pristine built-in actuator path can expose the policy to cleaner data than the real system will ever produce.

In contrast, the final control path intentionally allowed the policy to experience:

  • IO-rate computation at 200 Hz
  • slower policy output at 50 Hz
  • numerical roughness from the actual action and observation update path

This was important because the policy needed to tolerate the same class of stale, imperfect signals it would see during deployment.

5. Actuator interface

The action interface is a joint-space command interface over the 12 actuated leg joints.

Important characteristics:

  • DC motor actuator model with per-joint parameters
  • explicit actuator delay
  • torque-speed saturation
  • friction model
  • policy actions applied through a slower policy loop than the physics loop

Actuator delay settings and the full motor model are documented in Deep Dive: System Identification.

The action scaling rule preserved in the stack is:

action_scale = 0.30 * effort / stiffness

Another important implementation detail is that actuator damping comes from the controller path rather than from fixed XML damping on actuated joints. Default XML damping and friction-loss terms are removed for those joints to avoid double-counting dissipation.

6. Contact and toe observables

The environment tracks foot contact state, contact forces, foot air time, and toe position/velocity. These signals are used by the critic and reward system even when not exposed to the deployable actor. The toe model and its role in training are described in Deep Dive: System Identification.

Representative contact thresholds:

QuantityThreshold
foot contact observation5 N vertical-force threshold
reward-side contact helper10 N threshold

7. Commands and operating envelope

The nominal command envelope for the legs locomotion stack is conservative:

CommandRange
lin_vel_x(-0.8, 0.8)
lin_vel_y(-0.6, 0.6)
ang_vel_z(-0.6, 0.6)

This operating envelope is appropriate for early sim2real transfer and hardware bring-up.

8. Disturbances, resets, and play mode

The training environment also includes controlled perturbations and reset variation.

Representative settings include:

ItemRepresentative setting
push disturbance timingevery 1-3 s
push magnitudeapproximately +/-0.5 m/s in planar velocity, +/-1.5 m/s in pitch
reset yaw perturbation+/-pi
reset pitch perturbation+/-0.15 rad
reset roll perturbation+/-0.1 rad
bad-orientation terminationapproximately 45 deg tilt

Play mode disables policy corruption and push disturbance while preserving the rest of the deployment-relevant control path.

9. PPO configuration

The training setup uses PPO with a conventional configuration.

ParameterValue
learning rate1e-3
gamma0.99
lambda0.95
clip parameter0.2
entropy coefficient0.01
learning epochs5
mini-batches4
desired KL0.01
max grad norm1.0
rollout length24

The optimizer configuration is not the main differentiator of the stack. The environment fidelity and observation design are more important to transfer quality.

How is this guide?

On this page