
How to train LLM agents that can reliably plan, call tools, and recover from their own mistakes.
In the evolution of AI agents, we are witnessing a distinct phase shift. We are moving past the era of simple, single-turn query-response interactions implemented in simple assistants into a domain defined by sequential agency. We are no longer just asking models to answer a question; we are asking them to interface with an environment through multiple interactions to complete complex tasks that are multi-step, open-ended, and tool-heavy – e.g.,:
These are multi-turn, sequential decision problems: the agent must decide which tool to call, when, and with what arguments, potentially dozens of times before the task is complete. The environment state evolves with every tool call.
In this regime, Supervised Fine-Tuning (SFT) on “golden traces” is often not enough. We need agents that learn from interaction – from success, failure, and everything in between. That is exactly what Reinforcement Learning (RL) provides.
This post lays out how to think about multi-turn RL for sequential tool use, how to design reward functions that don’t explode in your face, and a set of practical recipes that we have found to work in real systems.
Before we talk about training, it’s worth contrasting the single-turn and multi-turn settings.
| Feature | Single Turn | Multi Turn |
|---|---|---|
| Interaction space | Interaction space Small, fairly predictable (e.g., “call one API once”) | Combinatorial explosion of tool sequences and arguments |
| Credit assignment | Trivial: the one call was right or wrong | Hard: final failure may come from any earlier decision |
| Training method | SFT usually works quite well | SFT struggles; RL is typically needed |
| Example | “Use search once and answer” | “Perform deep research with many calls, cross-checks, and revisions” |
A common workaround is to approximate multi-turn tasks by decomposing them into single-turn subproblems – for example, treat each retrieval call as an independent “episode” with its own label (“did we retrieve a relevant document?”). This is viable only if you can define good partial rewards on each step.
As tasks become longer-horizon and more open-ended, decomposition breaks down: what matters is whether the final outcome is good, not whether each individual step looks locally reasonable.
That’s where full multi-turn RL comes in.
Supervised Fine-Tuning takes a dataset of expert trajectories – “golden paths” where a human or stronger model has already solved the task – and trains the model to imitate those traces token-by-token.
This has clear benefits:
But SFT has equally clear limitations in the multi-turn setting:
In other words, SFT learns a map of known roads, but not how to navigate when you miss a turn or a road is blocked.
RL, in contrast, optimizes the policy to maximize expected cumulative reward in the actual environment:
Crucially, RL is not purely an alternative to SFT; it is usually stacked on top of an SFT-ed model. SFT gives you a warm start; RL teaches the model to adapt, generalize, and self-correct.
In multi-turn, tool-heavy settings, RL is usually the way to go. SFT alone quickly hits a ceiling.
A realistic multi-turn RL training loop for LLM agents looks more like a distributed systems diagram than a simple “environment + agent” cartoon.
The slide deck shows a helpful schematic: prompts flow into a trajectory generator, which talks to an inference service (the LLM), an environment (the tools), and a trainer, forming a closed loop.
Let’s unpack the key components.

The trajectory generator:
The output is a trajectory (a rollout): a sequence of user-assistant-tool messages capturing the interactions between different roles in the conversation.
The environment is everything outside the LLM:
In production-adjacent settings, environment issues are a major pain point:
Part of making multi-turn RL work in practice is stabilizing the environment: pinning versions, caching tool calls where possible, and standardizing error handling.
Rewards are applied to trajectories to score their quality. They can be specified in several ways, including:
Given batches of trajectories and associated rewards, the trainer performs policy gradient updates with some form of KL-regularization to avoid catastrophic drift from the base model.
In practice, this often looks like a PPO-style or GRPO-style objective of the form:
where
Putting this all together, we now have quite a few components to juggle. Yet once you have this loop working, you gain a very powerful capability: end-to-end optimization of agent behavior in realistic environments.
Reward design is where most agentic RL projects succeed or fail. A core axis is partial (step-level) vs. trajectory-level (episodic) rewards.
Here we assign rewards to intermediate steps, such as:
This yields dense signal, which is good for exploration and sample efficiency: the agent does not need to stumble upon a full successful trajectory to start receiving gradient signals.
However, partial rewards come with serious downsides:
At the other extreme, we assign reward only at the end:
This directly optimizes what we care about: did the agent solve the task? It is robust to many forms of proxy gaming, since there are no intermediate targets to overfit.
The downside is that the signal is sparse and high-variance:
In real-world deployments, trajectory-level rewards tend to be more practical, precisely because defining robust partial rewards across a complex multi-turn workflow is extremely difficult. It is already a non-trivial challenge to define “success” at the end; doing so at every intermediate step can quickly become intractable.
A useful pattern is:
Given the above, a pragmatic reward function for a multi-turn agent often looks like a weighted sum of a few components:
A critical design principle is: The final outcome reward must dominate. If an intermediate metric – like “successful search call” – is rewarded too heavily, the agent may converge to a degenerate policy: repeatedly call search in ways that look locally good to the heuristic, without ever finishing the task.
The safe recipe is “start simple”:
Training LLM agents with multi-turn RL is fragile. Over time, a few heuristics have emerged as consistently useful.
You need a base model that has non-trivial zero-shot success on the task—on the order of ~20% success out-of-the-box.
If your current model is below that threshold, you have two options:
When running RL for many thousands of updates, the final checkpoint is often not the best. RL can over-optimize for quirks of the training distribution or reward function.
A better practice is to:
Many mysterious RL failures are actually environment failures:
To mitigate this:
Exploration is crucial: the agent must try sufficiently diverse trajectories to find successful ones.
Practical knobs include:
The goal is to expand the support of behavior while keeping it within a manageable KL radius of the base model.
If tasks are too easy, you don’t need RL; if they are too hard, RL won’t help because the agent never sees reward.
A simple curriculum strategy:
Large discrepancies between training and production environments are a recipe for disappointment. Whenever feasible:
It is tempting to obsess over algorithmic variants (PPO vs GRPO vs RLOO, etc.). While these matter, in practice:
dominate overall performance. Algorithm choice is second-order.
To make this concrete, consider training a deep research agent: given a complex query (“Summarize the current state of X and compare approaches A, B, C”), the agent:

Training such an agent with multi-turn RL yields a learning curve like the one shown in the plot above: starting from a baseline performance somewhere around 0.5 reward, steadily climbing, and eventually surpassing a frontier model baseline (shown as a dashed red line). The improvement is asymptotically monotonic but noisy: there are plateaus and small regressions as the policy explores and the reward landscape shifts.
This illustrates the main value proposition of multi-turn RL for tool use: you can turn a weaker LLM into a specialized agent that outperforms frontier models on specific complex workflows, purely by optimizing how it interacts with tools and structures its reasoning over multiple turns.
If you’re thinking about building multi-turn, tool-using agents, here are the main lessons:
We are still early in understanding the full design space of long-horizon LLM agents. But the pattern is already clear: as we move from “models that answer questions” toward agents that act, multi-turn RL for sequential tool use will be one of the central levers for pushing capabilities forward.