WorkBenchMark: Benchmarking Robotic Assembly with LEGO
A shared yardstick for robotic assembly, built on LEGO Duplo: 400 tasks across four difficulty tiers, an open-vocabulary Assembly-by-Disassembly baseline, and a finding — structured planning beats a modern vision-language-action model at every tier, and the gap widens with complexity.
Our paper WorkBenchMark was accepted to the RoboCup Symposium 2026 (oral presentation). It’s a benchmark for robotic assembly, led by Wenbo Ma, with Matteo Tschesche (MASCOR, FH Aachen) and Till Hofmann — and it grew out of a simple frustration: assembly is a perfect storm of low-level manipulation and task-level reasoning under physical constraints, yet there’s no shared yardstick for comparing whole systems on it. Every paper rolls its own tasks and success criteria, so progress is hard to measure.
The task
We build the benchmark on LEGO Duplo. The bricks are cheap, reusable, and mechanically forgiving, but small placement errors still matter — bricks interlock and block one another, so the order you assemble them in is genuinely non-trivial. That makes Duplo a tidy stand-in for the harder industrial assembly the RoboCup Smart Manufacturing League is built around.
Each task hands the robot a pick area full of scattered bricks and a target product. The robot has to perceive the scene, work out a feasible build order, and assemble the thing — fully autonomously, no resets, no privileged state. A task counts as solved when the finished product sits in the assembly area and is no longer touched by the robot.


There are 400 tasks across four difficulty tiers (40 of them reproduced for real-robot experiments), scaling both the planning challenge and the manipulation challenge:
- Tier 1 — two-brick vertical stacking; basic pick-and-place.
- Tier 2 — multi-brick stacking (3–5), where small placement errors start to accumulate.
- Tier 3 — 3D shape layouts (3–12), needing both horizontal and vertical positioning.
- Tier 4 — complex interlocking structures with overhangs and stability dependencies that rule out most build orders.

A baseline, so it isn’t an empty arena
We also ship a baseline solution, so the benchmark comes with a real reference point rather than just a set of problems. It runs in three stages — plan → perceive → execute — with clean interfaces between them.
Plan — Assembly-by-Disassembly. The neat trick from automated manufacturing is to reason backwards: instead of searching forward over ways to build the product, ask how the finished product could be taken apart, removing only a part that’s actually free at each step, and then reverse that sequence into a feasible build order. We make every removal answer to geometry. A part is only a candidate if there’s a collision-free grasp for it (grasp reachability) and if taking it away leaves the rest stably supported (press-stability, approximated with a support polygon). The search runs over a voxel grid aligned to the LEGO stud lattice.
Perceive — open-vocabulary pose estimation. Parts are specified only at the task level (“the red 2×2 brick”), so fixed-class detectors don’t apply. We ground detection in free-form text with GroundingDINO, refine the boxes to masks with SAM, and lift those to 6D poses with FoundationPose — no task-specific fine-tuning.
Execute — collision-aware state machine. Approach → grasp → lift → move → place → release, with transit motions planned collision-free in MoveIt 2 / OMPL and the insertion done as a constrained vertical press to engage the studs. A failed grasp retries the next candidate before escalating to a replan.
What we found
We compared this planning-based pipeline against a modern vision-language-action setup: a VLM (Gemini 2.5 Flash) generating step-wise assembly instructions for a pretrained VLA policy (π₀.₅). The structured pipeline wins at every tier, and the gap widens as the structures get harder:
| Tier | Ours — success | VLA — success |
|---|---|---|
| 1 — two-brick stack | 96.7% | 73.3% |
| 2 — multi-brick stack | 90.0% | 63.3% |
| 3 — shape assembly | 70.0% | 23.3% |
| 4 — complex assembly | 66.7% | 6.7% |
| Overall | 80.8% | 41.7% |
The VLA is genuinely capable on the simplest stacks, but it falls apart as complexity rises: it tends not to reason about whether a grasp is reachable or whether the half-built structure is stable, so it proposes physically infeasible sequences — its stability-violation rate climbs past 80% on the hard tiers, against roughly 25% for ours at the hardest tier. The honest caveat is that our pipeline isn’t free: most of its wall-clock time goes to the perception stack, and the VLA actually plans faster. So this isn’t “learning loses.” It’s that, today, the structural and physical guarantees you need for interlocking assembly are far easier to get from explicit planning than to learn end-to-end.
Which is, conveniently, the same bet I made a couple of posts ago: keep the symbolic, physically-grounded scaffold legible, and let learning fill the gaps it can’t specify by hand. WorkBenchMark is partly an attempt to give that bet — and its opposite — a fair, shared test, rather than another one-off result.
The benchmark, the simulator, and the baseline will be released openly. The point isn’t this particular number; it’s to give the assembly community a continuous measure of progress toward integrated, real-world autonomy.
Reference — Wenbo Ma, Daniel Swoboda, Matteo Tschesche, Till Hofmann. “WorkBenchMark: A LEGO-Based Assembly Benchmark with an Assembly-by-Disassembly Baseline for the Smart Manufacturing League.” RoboCup Symposium, 2026. Project page · PDF