Skill

deploy-policy

Run a trained policy OUTSIDE its RL env — in a control loop or a multi-robot demo, not just the RL harness. Use after training a policy with stage-isaac-rl-env. It reconstructs the exact observation and action map (obs term order, action = raw*scale + default, joint order) so the exported actor runs standalone, and lifts a walk/locomotion ONNX into a torch MLP on the GPU because there is no onnxruntime CUDA provider on the DGX Spark (aarch64/GB10). The reach deploy is ambidextrous with no observation change. Driven by the robot descriptor; reference impl proven on the Unitree G1.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/robotics-connect:deploy-policy

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

A policy trained in Isaac Lab is only useful if it runs **outside** the training harness — in a control

Supporting Files

scripts/deploy.py

SKILL.md

174 lines · ~3.1k tokens

Stats

LanguagePython

Stars0

MaintenanceExcellent

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Deploy the policy out of the RL env

A policy trained in Isaac Lab is only useful if it runs outside the training harness — in a control loop, a multi-robot demo, or eventually on hardware. That means reproducing the env's observation and action exactly, by hand. scripts/deploy.py does it, lifted from the eye-verified armwaheed/robots#2 locomotion.py and generalized to read the joint set / EE links / action scale from the robot descriptor.

1. Reconstruct the obs/action map exactly

The exported actor is a stateless MLP — feed it the wrong observation and it silently misbehaves. The ReachPolicy rebuilds the training observation term-for-term:

base_lin_vel(3) base_ang_vel(3) projected_gravity(3) hand_target(7 = base-frame pos+quat)
joint_pos_rel(ALL joints) joint_vel_rel(ALL joints) last_action(len action_joints)

and the action map: target = raw * action_scale + default_joint_pos, applied to the descriptor's action joints (the locked, absent-on-hardware joints are excluded and stay at default — exactly as in training). The hand target is given in the base frame (the policy is yaw-invariant) and clamped into the trained command box, so a receding base can't drive an out-of-distribution command (which the policy would chase by leaning harder → stepping back → a runaway topple).

import deploy
desc = deploy.load_descriptor(".../descriptors/unitree_g1_29dof.json")
reacher = deploy.ReachPolicy(robot, "cuda", ".../exported/policy.pt", desc, action_scale=0.5)
reacher.set_world_target(headward_corner_xyz)   # aim it; ambidextrous — picks the same-side hand
# per control step (~50 Hz):
reacher.act()

Ambidexterity is free at deploy — ReachPolicy reads the active hand from the (clamped) target's y-sign, so two robots flanking opposite sides of a bed each lead with their natural same-side hand, with no change to the observation. active_wrist_link() tells you which wrist the grasp attaches to.

2. Lift the walk ONNX into torch (no onnxruntime GPU on the Spark)

The DGX Spark (aarch64 / GB10, sm_121) has no onnxruntime CUDA execution provider, and the project rule is no CPU inference. load_onnx_mlp_to_torch lifts an exported MLP-actor ONNX (a Gemm/ELU stack) straight into a torch.nn.Linear stack on cuda — ONNX Gemm transB=1 stores weight as (out, in), exactly torch.nn.Linear layout, so the weights copy 1:1 (parity ~3e-6).

VelocityWalker uses it to run Unitree's velocity-walk policy (480-D obs, 5-step term-major history, raw*0.25 + default action) so the robot walks to the work site before the reach policy takes over. The walk's joint order + default pose come from the descriptor (sim_asset.joint_order, effectors.default_pose).

walker = deploy.VelocityWalker(robot, "cuda", ".../g1_velocity_walk.onnx",
                               joint_names=desc["sim_asset"]["joint_order"],
                               default_pos=[...])   # the policy's deploy default
walker.act([vx, vy, wz])   # per control step

The behaviour layer owns the handoff

Walking to a position, deciding to release / retry / ask a peer, and avoiding another robot are behaviour-layer jobs, not terms in the low-level RL policy (training them in would make it intractable). The deploy pattern is: behaviour picks the target and the walk→reach handoff; VelocityWalker gets the robot there; ReachPolicy does the planted, balanced, ambidextrous reach.

On hardware — the de-risk ladder + low-level takeover

Running the policy in a control loop (above) is not the same as putting it on the real motors. A whole-body policy drives the legs, so deploying it means taking the legs off the vendor balance controller and letting the RL net balance — a first transfer falls if anything is off, fast (sub-second). Stage it as a ladder; each rung gates the next (full rationale + the stop hierarchy in SAFETY.md). The robot-agnostic implementation is lib/policy_deploy.py (contract + obs builder + the three rungs, with SafeStop baked into every motion stage); a robot supplies a RobotIO:

Rung	Runs	Fall risk	Mechanism
0 offline	obs+policy, printed, no commands	none	read `rt/lowstate`; build the obs; print actions
1 partial (preferred)	policy's arms only, legs on vendor balance	none	`rt/arm_sdk` weight-blend overlay (`motor_cmd[29].q = weight`)
2 whole-body (last resort)	all joints, operator-driven low-level mode (NOT a software release)	HIGH	`rt/lowcmd` (explicit VOLATILE QoS), robot feet-on-ground + slack gantry

Prefer rung 1 for any useful task the arms can do — fall-safe, no gantry. Take rung 2 only if the task demonstrably needs whole-body balance the vendor controller can't provide.

Verify the joint order from the ACTUAL env, not just the descriptor. IsaacLab orders the articulation DOFs interleaved (left/right pairs by tree depth) — e.g. action index 9 is left_shoulder_pitch, sitting between the knees (7,8) and ankles (11,12). That is not the Unitree SDK's 0..28 index order. Dump the ground-truth contract by booting the exact training env and reading robot.joint_names / the action term's resolved order + scale + default offset (armwaheed/robots#3 rl/dump_deploy_contract.py → deploy_contract.json), then map sim↔SDK by joint name. Self-check at rung 0: the predicted standing-pose offsets must land on the named joints (G1 BalanceStand crouch → left_knee joint_pos_rel ≈ +0.2, hip_pitch ≈ −0.17, arms ≈ 0).

G1 EDU motor table (verified live): indexed like the 29-DOF G1 — legs 0–11, waist_yaw 12, L-arm 15–19, R-arm 22–26; the 6 joints the EDU lacks (13,14,20,21,27,28) are present-but-zero and never in the 23-joint action set.

First, the bigger call: PREFER rung 1 (arm overlay) on the G1 EDU. Whole-body rt/lowcmd is only valid when the high-level controller is fully off (Develop mode); against a live controller it jitters/oscillates (unitree_sdk2_python#43/#108). The rt/arm_sdk overlay blends with the running balancer instead of fighting it, so the bedside reach is fall-safe and gantry-free. Use the vendor LocoClient.Move for locomotion. Treat rung 2 as gantry-only research.

Rung-2 takeover (the parts that bite):

Hand off to low-level control BY HAND, then VERIFY by SERVICE LIVENESS — never a software release, and NOT via CheckMode. The operator drives the G1 into Develop mode (suspend → L2+B → L2+R2). CheckMode is useless as a gate — on the EDU it returns name='ai' in both AI-Sport and Develop, and rt/sportmodestate is silent on this variant. The reliable signal is a read-only loco GET-FSM RPC (ROBOT_API_ID_LOCO_GET_FSM_ID): it answers (code 0) when the balancer is running and errors/times out when it's off (Develop → code 3102, verified live). verify_whole_body_ready gates on that (answers → refuse; down → proceed; probe unavailable → proceed only on an operator Develop assertion). There is no ReleaseMode() call — that software path took the legs in a real incident (SAFETY.md §0).
Prove the abort latches in this mode first (confirm_abort_live: operator presses+releases, code sees the latch). In Regular/AI-Sport mode the buttons fire vendor gestures, not aborts.
rt/lowcmd writer = explicit VOLATILE + keep-last-1 QoS so a torn-down writer leaves no latched command. The plain ChannelPublisher can't set this — build the Channel from the factory and call SetWriter(qos) (verified against the robot's own unitree_sdk2_python, cyclonedds 0.10.2).
Match the sim PD gains (G1 bed-reach: hip 100/2, knee 150/4, ankle 40/2, waist 200/5, arms 40/10) — read from the deploy contract so they can't drift. Set mode_pr=PR, echo mode_machine from rt/lowstate, motor_cmd[i].mode=1.
Kill the activation transient: move to default BEFORE the policy. A policy started from an off-default pose commands a huge first action. Scripted gain-ramped move to the EXACT default pose (damping-first) → start the policy on a neutral command (≈0 first action) and ramp the command → first test = STAND, add a reach only after a stable stand. (Rung 1 uses the same idea: ramp the arm_sdk weight 0→1 holding the live pose.)
Wrap the loop in SafeStop so it damps on return/exception/SIGINT/ SIGTERM, and never kill -9 it (a hard kill latches the last high-gain command → runaway; this broke a window once — SAFETY.md §0). Abort/end → low-level damp (kp=0, kd≈3).

Hard-won on hardware (IRL bed-making, 2026-06-17)

Blend EVERY transition — including the exit/abort release, not just entry. An un-blended overlay release (weight→0 in one tick) makes the vendor controller yank the limb across to its neutral pose — a jarring, unsafe-looking jerk. The damp is mode-aware: a compliant overlay RAMPS weight→0 (re-publishing the last pose); a rigid/whole-body authority damps fast. See the RobotIO.damp_once contract + the G1 binding. For a planned exit after reaching into clutter, retrace the recorded approach path in reverse before releasing — a direct return can catch on the furniture the careful approach avoided (hardware-observed: the hand snagged the mattress on a direct return).
The loco-liveness mode gate flakes under DDS contention. A LocoClient created AFTER the 500 Hz rt/lowstate subscriber (and the policy load) can't match its reply reader → GET_FSM_ID times out (code 3104) even when the balancer is up (a clean-context probe returns code 0). Prime the loco client in connect() BEFORE the subscribers and POLL (first code-0 = up); a single in-context call false-negatives. The robust fix is a clean-SUBPROCESS liveness probe; until then gate runs on an independent clean GET_FSM_ID 0 (--assume-balancer-up), which is false-negative-only + fall-safe.
In the arm-overlay path, the VENDOR balancer carries the manipulation/draw load — not the policy. The RL policy's whole-body load-robustness (grip-slip / pull-load training) applies only to the whole-body rung; in overlay mode the policy merely generates ARM targets while the factory balancer holds the legs. (Confirmed live: torso twist stayed within balance limits, the factory balancer didn't even compensate during a draw.) So a deterministic arm trajectory is just as load-robust there.
The RL reach policy can't do a high, collision-free approach to a surface. Its trained command box caps the hand below shoulder height, and a single hand-xyz target can't hold a "shoulders rolled wide" configuration — so it comes in low and drags. For a clean top-down approach, do a deterministic joint-space lift (stay shoulder-wide while extending so the arm clears the surface edge) and hand off to the RL policy only for the final reach + draw — a deterministic-approach / RL-manipulation hybrid.

Generalization

ReachPolicy and VelocityWalker take the joint set, EE links, and scales from the descriptor, so the same deploy code runs a 23-DOF or 29-DOF G1, or a new humanoid, once its descriptor and exported policy exist. The only robot-specific knowledge is in the descriptor — which is the whole point. The hardware ladder above is likewise robot-agnostic: only the damp/release/gain bindings change per robot, and the safe-stop contract (lib/safe_stop.py) is mandatory for every robot.

deploy-policy

Invocation

Context Preview

Supporting Files

SKILL.md

deploy-policy

Invocation

Context Preview

Supporting Files

SKILL.md

Deploy the policy out of the RL env

1. Reconstruct the obs/action map exactly

2. Lift the walk ONNX into torch (no onnxruntime GPU on the Spark)

The behaviour layer owns the handoff

On hardware — the de-risk ladder + low-level takeover

Hard-won on hardware (IRL bed-making, 2026-06-17)

Generalization

Similar Skills

Deploy the policy out of the RL env

1. Reconstruct the obs/action map exactly

2. Lift the walk ONNX into torch (no onnxruntime GPU on the Spark)

The behaviour layer owns the handoff

On hardware — the de-risk ladder + low-level takeover

Hard-won on hardware (IRL bed-making, 2026-06-17)

Generalization

Similar Skills