Menlo
Locomotion Training

Deep Dive: System Identification

This chapter documents the hardware-to-simulation quantities that were identified or constrained for stable locomotion transfer.

1. Hardware mapping comes first

Before tuning rewards or training parameters, the joint-level hardware mapping must be correct.

For Asimov legs, the following items were especially important:

  • the ankle is not directly driven
  • ankle pitch and roll are produced through a parallel mechanism
  • toe behavior is passive and spring-driven
  • leg joints use different actuator families with different reflected inertia and torque-speed limits

Errors in this mapping produced unstable or seemingly random policy behavior.

2. Armature as reflected rotor inertia

In this stack, joint armature should not be interpreted as the literal motor armature. It is used as a reflected inertia term that captures how the motor and gearbox appear at the joint.

This distinction matters because armature strongly affects closed-loop behavior and stability.

Representative identified values are:

Joint familyExample valueNotes
hip pitch0.095625From motor datasheet and transmission mapping
knee0.0339552From motor datasheet and transmission mapping
ankle0.0565056Doubled to reflect two motors driving the parallel ankle

The ankle value required special treatment because pitch and roll are driven by two motors through the RSU ankle mechanism.

3. KP/KD consistency between sim and hardware

Even with calculated KP/KD values, the robot exhibited vibration on startup. After analysis, the root cause was not a hardware limitation — the real motor controllers worked fine. The problem was that the simulation KP/KD values produced an underdamped system, and the policy learned to behave accordingly.

The policy is simply an MLP. Its job is to model some non-linear function based on the data provided to it and how its weights get updated. When trained on data from an underdamped simulated system, the policy learned to behave like an underdamped controller. And how does an underdamped control system respond to an impulse? It oscillates.

This was verified mathematically: the policy's output behavior matched the impulse response of an underdamped second-order system.

The failure chain is:

  1. simulation KP/KD values produce underdamped dynamics
  2. the policy trains on this data and learns to behave like an underdamped controller
  3. on real hardware, the gains are fine — but the policy's learned behavior is already underdamped
  4. the policy's corrections overshoot, and each overshoot triggers a larger correction on the next cycle, exciting sustained oscillation

This realization was critical because it reframed the locomotion problem:

How do I make the domain of data between sim and real match as closely as possible?

The practical lesson is not just about constraining gains to hardware limits — it is about ensuring the simulated dynamics produce training data that matches the real system's response characteristics. If the sim data domain diverges from the real data domain, the policy will learn behavior that does not transfer, regardless of whether the individual parameter values are physically plausible.

4. Motor model details that mattered

The actuator model includes more than simple PD control. The simulation stack models:

  • per-joint stiffness and damping
  • effort limits
  • speed-torque saturation
  • reflected inertia through armature
  • static and dynamic friction
  • explicit action delay

This richer actuator model was a significant part of the sim2real improvement.

Representative actuator parameters for the legs stack include:

ParameterExample valueNote
stiffness65.0chosen as a safe deployable value
damping5.0tuned to match real system response
effort limit39.40peak torque for the modeled joint family
saturation effort120.0speed-torque saturation behavior
velocity limit12.57 rad/sfrom motor specification
friction static1.30static friction term
friction dynamic0.100Coulomb-like dynamic friction term

The simulated actuator path is then wrapped in an explicit delay model with delay_min_lag=0 and delay_max_lag=1.

5. Delay is part of identification

Actuator delay was not treated as a generic nuisance term. It was modeled from the observed timing behavior of the real firmware and communication path.

The training model therefore includes:

  • action delay on the actuator path
  • grouped observation delay on the sensing path
  • real CAN timing structure rather than perfectly synchronized joint state

These delays are part of the identified system, not just regularization noise.

This same reasoning also motivated the move away from an overly pristine built-in actuator interpretation toward a control path that better reflected what the policy would actually see at IO rate.

6. Toe model identification

The toe joint is passive, but it still affects whole-body stability through contact and push-off.

The simulator therefore needs:

  • toe stiffness
  • toe damping
  • toe limits
  • toe collision geometry
  • toe-ground contact behavior

In practice, toe resistance had to be increased relative to early assumptions because insufficient toe support caused the policy to ignore the toe during learning.

Toe state was exposed to the critic, not the actor. This allowed training to capture the stabilizing effect of the toe without introducing a deploy-time dependency on unmeasured joint state.

7. Collision geometry is also system identification

Contact behavior is highly sensitive to geometry. The locomotion environment therefore replaced detailed mesh collision with simpler capsule-based foot and toe geometry.

This choice improved determinism and reduced the risk of learning artifacts from unstable mesh contact.

The identified contact model includes:

  • multiple foot and toe capsules
  • explicit foot-ground contact sensing
  • toe contact sensing
  • tuned friction and contact dimensions on foot and toe geoms

The final contact configuration emphasized repeatability:

Contact settingValue / choice
contact primitivecapsules instead of mesh collision
foot / toe friction0.6
contact dimensioncondim=3 on foot and toe geometry
capsule radiusapproximately 12 mm

In practice, multiple heel, midfoot, and toe capsules were used so the support polygon was more stable than a single coarse collision shape.

8. Soft limits and deployable ranges

The policy is not trained to use the full hard-stop hardware range. Instead, training uses soft joint limits, typically at 0.9 of the hardware range.

This reduces:

  • hard-stop impacts
  • unrealistic exploitation of boundary states
  • deployment-time shock loads near limit boundaries

The soft-limit factor used in training was approximately 0.9 of the hardware range.

9. Geometry errors can invalidate learning

System identification also includes checking the geometry itself. One important example was toe alignment: when the toes were accidentally tilted relative to the intended flat contact pose, the policy stopped learning effective forward-balance recovery.

This is a useful reminder that a locomotion policy can fail even when gains and rewards are reasonable, simply because the physical model is not internally consistent.

How is this guide?

On this page