Keep the Learning Inside the Plan
An opinion I formed the hard way — for long-horizon robot tasks, the durable scaffold is symbolic and legible. Learning is best poured into that structure, not wrapped around it.
A couple of years ago, for my master’s thesis, I tried to make one neural network do two jobs at once: read a cluttered scene into the symbolic facts a planner needs — the red cube is left of the blue one — and then carry out the planner’s actions, like pick(red_cube), from that same learned representation. It half-worked. The grounding was promising; the execution errors piled up over long sequences. The tidy conclusion would have been “scale it up.” The honest one was that I’d been quietly hoping the network would absorb structure I should have kept explicit.
So I’ve come around to a narrower question. For long-horizon manipulation, the interesting thing isn’t how much of the robot we can learn — it’s how little we should. The symbolic part — what a task decomposes into, what has to be true before an action makes sense, what counts as progress — is the part that stays stable across kitchens and warehouses, and the part a person can actually inspect when something goes wrong. That’s worth protecting, not dissolving into weights.
Learning, then, earns its keep filling the gaps that structure can’t specify by hand: which of many feasible actions is worth trying here, what a symbol looks like in this particular pile of objects, how an abstract step turns into motion. That’s roughly the shape of what I work on now — guiding a planner with a sense of what’s physically feasible so it generalises instead of re-deriving everything per task, and learning representations that move between the abstract and the concrete. I don’t have clean answers, and these aren’t my private territory; they’re what my whole group is chipping at.
The field keeps swinging toward bigger end-to-end policies, and they’re genuinely impressive on short, well-demonstrated skills. But “stack a hundred of these into a coherent half-hour task” is precisely where I watched my own end-to-end ambitions fall over. My bet — unfashionable, and I think right — is that the scaffold should stay symbolic and legible, and the learning should go inside it rather than around it.
That’s a bet, not a result. Ask me again in a few years.