Reward Design
This chapter documents the reward design used for Asimov locomotion and the main differences from common open-source baselines.
1. Reward design was not the main bottleneck
The locomotion policy did not become deployable through reward shaping alone. Stable transfer depended more strongly on:
- actuator modeling
- observation timing
- deployable observation design
- real controller constraints
Reward design still matters, but it should be understood as one component of the stack rather than the sole driver of performance.
The practical lesson from the legs stack is that reward changes alone did not solve transfer. The walking policy became deployable only after the actuator model, timing model, and observation interface were brought closer to hardware.
2. Core rewards kept from existing baselines
The Asimov reward set was heavily influenced by open-source humanoid locomotion work, especially Booster-style reward structure.
Representative retained terms include:
| Reward | Weight | Purpose |
|---|---|---|
tracking_lin_vel | +1.0 (base, curriculum-scaled) | Follow commanded linear velocity |
tracking_ang_vel | +0.5 (base, curriculum-scaled) | Follow commanded yaw rate |
orientation | -5.0 | Penalize deviation from upright orientation |
upright | curriculum | Maintain stable torso posture |
action_rate | -1.0 | Smooth action changes |
torques | -2e-4 | Encourage efficient actuation |
3. No gait clock
Some locomotion baselines provide an explicit gait phase clock to the policy. Asimov does not.
This choice was made because:
- Asimov kinematics are not identical to baseline robots
- the ankle range is limited by the parallel mechanism
- the policy should discover a gait that fits this hardware rather than follow a hand-imposed gait phase
This makes the policy less prescriptive and more hardware-specific.
4. Asymmetric pose tolerances
Uniform pose tolerances across all joints are not appropriate for Asimov. The legs use different tolerances depending on the joint and the hardware structure.
Representative walking tolerances are:
| Joint | Typical tolerance |
|---|---|
| hip pitch | 0.5 |
| hip roll | 0.25 |
| hip yaw | 0.2 |
| knee | 0.5 |
| ankle pitch | 0.2 |
| ankle roll | 0.12 |
The ankle tolerances are tight because the real ankle range is limited.
5. Narrow-stance stability penalties
Asimov has a narrower stance than many humanoid baselines. This increases lateral balance sensitivity and motivates stronger stability penalties.
Representative terms include:
| Reward | Weight |
|---|---|
body_ang_vel | -0.08 |
angular_momentum | -0.03 |
These terms help reduce large pelvis rotation and unstable whole-body motion.
6. Contact-force limits
The reward set penalizes excessive ground reaction forces.
This serves two purposes:
- it discourages aggressive stomping behavior
- it protects the real robot from unnecessary impact loading
Representative terms include:
| Reward | Weight | Note |
|---|---|---|
feet_contact_force_limit | -5e-4 | penalizes forces above approximately 350 N |
feet_stumble | -1.25 | penalizes large horizontal-to-vertical contact ratios |
7. Air-time reward
Asimov legs are light enough to support dynamic walking with noticeable swing and brief unloaded phases. An air-time reward is therefore used to discourage shuffling behavior.
Representative term:
| Reward | Weight |
|---|---|
air_time | +0.5 |
This reward encourages dynamic gait emergence rather than static stepping.
8. Consolidated reward table
The legs policy used a compact reward set rather than a large collection of highly specialized terms.
| Reward | Weight | Role |
|---|---|---|
tracking_lin_vel | +1.0 (curriculum-scaled) | commanded linear velocity tracking |
tracking_ang_vel | +0.5 (curriculum-scaled) | commanded yaw tracking |
orientation | -5.0 | penalize orientation deviation |
air_time | +0.5 | dynamic stepping |
action_rate | -1.0 | smooth action changes |
torques | -2e-4 | efficient actuation |
pose | curriculum | posture shaping |
upright | curriculum | torso stability |
body_ang_vel | -0.08 | pelvis rotation penalty |
angular_momentum | -0.03 | global stability penalty |
self_collisions | -1.0 | reject self-contact |
feet_stumble | -1.25 | discourage unstable foot strikes |
feet_contact_force_limit | -5e-4 | discourage excessive ground impact |
9. Practical lesson
The most important lesson from this stack is that reward design should remain consistent with the hardware interface.
It is counterproductive to reward behaviors that require:
- unavailable sensors
- unrealistic joint range
- unrealistically fast force response
- contact conditions that the deployed robot cannot reproduce
For Asimov, reward design works best when it reflects the real limitations and affordances of the leg hardware.
How is this guide?