Eval Long Horizon Execution Issue 1056 Github

Leo Migdal

-Nov 17, 2025, 6:06 PM

eval long horizon execution issue 1056 github

There was an error while loading. Please reload this page. There was an error while loading. Please reload this page. This project contains the code accompanying the paper The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs. If you like our work, consider citing us!

To use OpenRouter API, set your API key: The project uses a hierarchical configuration system with the following main components: Here's a comprehensive example with all configuration options: Toolathlon is a benchmark to assess language agents' general tool use in realistic environments. It features 600+ diverse tools based on real-world software environments. Each task requires long-horizon tool calls to complete.

Below we show a demo task where the agent needs to automatically check assignments in the email box, and grade them on Canvas. If you are unable/unwilling to install docker/podman, but still want to try our benchmark, please refer to README_nodocker.md. Make sure you have uv installed, otherwise please install it: We provide one command to install everything, we maintain the environment with uv. Just run: For each task we setup a separate container for it to be executed.

We assume you have docker or podman installed and correctly configurated. Please specify your choice on these two in configs/global_configs.py. This project contains the dataset accompanying the paper "The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs" Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete.

Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. We propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. We find that larger models can correctly execute significantly more turns even when small models have 100% single-turn accuracy. We observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations -- curiously, we observe a self-conditioning effect -- models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size.

In contrast, recent thinking models do not self-condition, and can also execute much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of task they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits... GitHub: https://github.com/long-horizon-execution/measuring-execution/ This dataset is a synthetic benchmark designed to measure the pure execution capability of LLMs over long horizons. The core task is key-value dictionary addition.

A fixed, in-context dictionary mapping five-letter English words (keys) to integer values is provided in dictionary.json. The model's goal is to maintain a running sum. In each turn, it receives one or more keys (defined by the turn complexity, K), retrieves their corresponding values from the dictionary, adds them to the running sum, and outputs the new sum. The primary metric for evaluation is the task length: the number of steps a model can execute before its accuracy drops below a certain threshold. The dataset is designed to be programmatically generated and thus contamination-free. We only provide 100 samples here for ease of access, but more can be generated using the script here.

There was an error while loading. Please reload this page. This organization has no public members. You must be a member to see who’s a part of this organization. There was an error while loading. Please reload this page.

There was an error while loading. Please reload this page. This work introduces a framework harnessing the capabilities of Large Language Models (LLMs) to generate primitive task conditions for generalizable long-horizon manipulations with novel objects and unseen tasks. These task conditions serve as guides for the generation and adjustment of Dynamic Movement Primitives (DMP) trajectories for longhorizon task execution. We further create a challenging robotic manipulation task suite based on Pybullet for long-horizon task evaluation. Extensive experiments in both simulated and realworld environments demonstrate the effectiveness of our framework on both familiar tasks involving new objects and novel but related tasks, highlighting the potential of LLMs in enhancing robotic...

We leverage LLMs to generate and generalize primitive task conditions for both familiar tasks with novel objects and novel but related tasks. Subsequently, the highlevel task conditions guide the generation and adjustment of lowlevel trajectories originally learned from demonstrations for longhorizon task execution. We evaluate the ability of our framework to generate and generalize task conditions on all 10 primitive tasks. The LLM (GPT-3.5) is provided with condition examples. Comparison is made with task conditions generated from environments (methods explained in paper). A successfully generated task condition should contain accurate and enough information to guide the execution of the primitive task.

To evaluate our framework's ability to execute long-horizon tasks using DMP trajectories generated and adjuested by task conditions, we design a challenging Robotic Manipulation Task Suite in Pybullet. The environment consists of two 7 Dof robots Franka and Kinova with a kitchen scene including various interactive objects. It contains 10 diverse primitive tasks (37 if considering different objects) and 4 long-horizon tasks in simulation. ↗ arXiv ↗ Hugging Face ↗ Papers with Code Current mobile manipulation methods struggle with generalizing skills across various objects and environments and executing long-horizon tasks reliably. Many approaches either rely on simplified pick-and-place actions or lack the ability to handle complex real-world scenarios.

Existing imitation learning methods often suffer from compounding errors during long sequences. This limits their applicability to more involved real-world problems. WildLMa addresses these issues by using VR teleoperation for data collection, a language-conditioned imitation learning method that enhances skill generalizability, and a skill library composed by an LLM planner for complex task execution. This approach leads to significantly improved success rates on various manipulation tasks, demonstrating the ability to handle long-horizon tasks robustly and generalize to unseen situations. This methodology represents a significant step towards creating more versatile and capable mobile robots. This paper is important because it tackles the challenging problem of long-horizon, generalizable mobile manipulation in real-world environments.

It presents a novel framework that significantly advances the state-of-the-art by combining high-quality training data, a language-conditioned imitation learning approach, and an LLM planner for complex task execution. This work opens avenues for broader applications of robots in unstructured environments and inspires future research on improving robot adaptability and autonomy. 🔼 WildLMa uses a quadruped robot with a whole-body controller and imitation learning to perform in-the-wild manipulation. The figure shows three aspects: (a) The robot performing long-horizon loco-manipulation tasks in various indoor and outdoor environments. (b) The process of collecting training data for imitation learning via teleoperation. (c) The library of learned skills that can be composed by a Large Language Model (LLM) planner to perform more complex tasks.

There was an error while loading. Please reload this page. Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs’ strengths in semantic reasoning and long-horizon planning. These System 2 capabilities—characterized by deliberative, goal-directed thinking—remain underexplored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation.

RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and... The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences. Human operators execute the subtasks in simulation, yielding high-quality trajectories with dynamic object variations. Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser subtask annotations, forming a more rigorous testbed for evaluating System 2 reasoning. RoboCerebra contains 1,000 human-annotated trajectories across 100 task variants, each spanning up to 3,000 simulation steps. Tasks cover everyday household activities (e.g., preparing drinks, tidying groceries) and are annotated with fine-grained subtask boundaries, temporal segments, and dynamic scene variations.

At inference time, the VLM parses a high-level task instruction into a sequence of step-level subgoals, which are stored in a memory bank. The VLA continuously queries the active subgoal and executes corresponding low-level actions based on high-frequency visual observations. Concurrently, the VLM periodically attends to recent observations to monitor execution progress. Upon detecting subgoal completion or deviation, it updates the memory with the next subgoal or issues a refined instruction. This closed-loop coordination preserves temporal abstraction while ensuring reactive control, enabling robust and interpretable performance in long-horizon tasks. We evaluate each method over 600 rollouts (60 tasks × 10 trials).

For fair comparison across planning models, we define a set of anchor points that determine when System 1 transitions between subgoals. These anchor-aligned transitions decouple step-switching from the model, allowing consistent temporal granularity across models. Website adapted from the Academic Project Page Template.

Eval Long Horizon Execution Issue 1056 Github

People Also Search

There Was An Error While Loading. Please Reload This Page.

To Use OpenRouter API, Set Your API Key: The Project

Below We Show A Demo Task Where The Agent Needs

We Assume You Have Docker Or Podman Installed And Correctly

Then, We Argue That Failures Of LLMs When Simple Tasks