Scaling Up Natural Language Understanding for Multi-Robots Through the Lens of Hierarchy

*Equal Contribution, 1Tsinghua University, 2Carnegie Mellon University

Abstract

To allow non-experts to easily specify long-horizon, multi-robot collaborative tasks, researchers are increasingly using language models to translate human natural language commands into formal specifications. However, because translation can occur in multiple ways, these specifications may not always be accurate nor result in efficient downstream multi-robot planning. Our core insight is that specifications should be concise representations, making it easier for downstream planners to find optimal solutions while remaining straightforward to derive from human instructions. Given the superior performance of multi-robot planners with hierarchical specifications, we represent tasks using hierarchical structures and introduce a full pipeline that translates natural language commands into hierarchical Linear Temporal Logic (LTL) and then solves the corresponding planning problem. The translation happens in two steps with the help of Large Language Models (LLMs). Initially, an LLM transforms the instructions into a hierarchical representation defined as Hierarchical Task Tree, capturing the logical and temporal relations among tasks. Following this, a fine-tuned LLM translates sub-tasks of each task into flat LTL formulas, aggregating them to form hierarchical LTL specifications. These specifications are then leveraged for planning using off-the-shelf planners. Our framework showcases the potential of the LLM in harnessing hierarchical reasoning to automate multi-robot task planning. Through evaluations in both simulation and real-world experiments involving human participants, we demonstrate that our method can handle more complex instructions compared to existing methods. Moreover, the results indicate that our approach achieves higher success rates and lower costs in multi-robot task allocation and plan generation.



Overview

overview image

Overview of the framework, using the dishwasher loading problem as a case study. Note that the non-leaf nodes in the Hierarchical Task Tree (HTT), the language descriptions of sibling tasks, and the flat specifications are color-coded to indicate one-to-one correspondence.

The HTT tree is structured such that it unfolds level by level, where each child task is a decomposition of its parent task. Notably, the tasks at the bottom level are not necessarily indecomposable. This flexibility allows for varying numbers of levels and tasks per level, accommodating differences in task understanding and the range of primitive actions available to robot agents.

Additionally, it's important to highlight that the relation $R$ specifically captures the temporal relationships between sibling tasks that share the same parent. The temporal relationship between any two tasks in the tree can be inferred by tracing their lineage back to their common ancestor, thereby simplifying the overall complexity of the structure.

When a task instruction is received, we use LLMs to construct the HTT through a two-step process, as outlined in step 1.

  1. Logical search: For every non-leaf node $v$, we gather its child tasks $V'$ and the temporal relations among them, defined by $R' \subseteq V' \times V'$. We then use LLMs to rephrase these child tasks and their temporal relations into syntactically correct sentences aligned with the semantics of LTL specifications (as illustrated in step 2.1). These reformulated sentences are input into a fine-tuned LLM that produces a single LTL formula (as depicted in step 2.2).
  2. Action Completion: Given an HTT, each leaf node should represent a simple task on certain object, such as "task 1.1.1 place plates into the lower rack" in example. By viewing such simple task as a sequence of action steps, we prompt the LLM to generate a sequence of pre-defined API calls to expand the simple task. For instance, the symbol $\pi_{\text{plates}}^l$ that represents task 1.1.1 can be replaced with LTL specification composed of sequential APIs (step 2.2): $$\pi_{\text{plates}}^l = \Diamond ( \texttt{Pickup(plate)} \wedge \Diamond\, \texttt{Move(plate, lower_rack)})$$ After this step, a complete hierarchical LTL specifications is generated (example in step 2.1).


AI2-THOR experiments

To evaluate our method on tasks with more complex temporal requirements, we combine several base tasks in the The ALFRED dataset to generate derivative tasks (each derivative task can be composed with up to 4 base tasks).

We compare our method with SMART-LLM. SMART-LLM uses LLMs to generate Python scripts that invoke predefined APIs for the purposes of task decomposition and task allocation.

The metrics are as follows:

  1. Success rate, which measures whether the target goal states of objects are achieved and if the order in which these states occur satisfies the specified temporal requirements.
  2. Travel cost, measured in meters, is defined as the total distance traveled by all robots, excluding any costs related to manipulation.
  3. Completion time, quantified as the number of discrete time steps required to complete the tasks.


1 base task with 4 robots: Place a computer on the ottoman

1 base task with 4 robots: Pick up a green candle and place it on the countertop

For derivatives tasks with 1 base task, our methods can effectively choose the shortest path while avoiding the repeating actions when executing sub-tasks, compared with the SMART-LLM which primarily seeks to identify a feasible solution without focusing on optimizing costs.

4 base tasks with 1 robot in dinning room

4 base tasks with 4 robots in dinning room

4 base tasks with 4 robots in bedroom

For derivative tasks with 4 base tasks, SMART-LLM failed to generate a feasible plan, as its prompt scripts always exceed the maximum content length the LLM accepts when the task content gradually becomes longer and more complex. However, as the HTT structure decomposes long tasks into tree-like dependencies, our method can effectively handle long tasks composed of multiple basic tasks, extract the temporal logic relationships, and provide planning. Here we demonstrated the planning of 4 robots under derivative tasks with 4 base tasks in two different environments.

Real-world experiments

Our real-world experiments are conducted in a tabletop setting, where the task involves a robotic arm placing fruits and vegetables onto colored plates.

Given the primarily 2D nature of the task, we convert the tabletop environment into a discrete grid world. The use of only one robotic arm simplifies the task compared to the multi-robot scenarios in the simulator, as it eliminates the need for task allocation.

    Our evaluation focuses on two main aspects:
  1. the adaptability to different verbal tones and styles from various users;
  2. the comparative effectiveness of our plan solution against existing methods.


same task with different object arrangements

A sequence of images, arranged from left to right and top to bottom, depicts the task. "First, put a set of keychains on the armchair. Retrieve a pencil and put it on the side table. Move the phone and the bat to the bed in any order", objects and their trajectories are marked with different colors as follows, keychains (red), bat (blue), pencil (purple) and phone (green). $t$ represents the discrete time steps in simulation.

Snapshots depict four arms performing real-world tasks of picking and placing objects via handover. The instruction given is, "Please move the blue, green, and multi-colored blocks to the two opposite boxes, place the colored ones after the green ones."" Target areas are colored in magenta. The block being selected is emphasized with an ellipse, and the remaining blocks are contained within rectangles.
The figure above is a comparative overview of snapshots between our approach and LLMs for task "Place the green apple in the pink plate, the orange in the yellow plate and the red apple in the blue plate. The order of placement is not specified and can be chosen freely".


Our method gives an optimal plan based on the positions of the fruits, whereas LLMs basically follow the sequence in which the fruits are mentioned in the instructions.

As observed, both our approach and the LLM achieve a high success rate, which aligns with the expectations given the task complexities. This confirms that our method is capable of adapting to various user instructions. Regarding cost, for the first four tasks that require sequential actions, the costs are identical. For the latter four tasks, which allow for multiple feasible solutions, our method consistently produces lower-cost paths, with the exception of task 5. In this simple task, the LLM also manages to create an optimal plan based on the placement of fruits.