TRL Integration Examples

This directory contains examples showing how to use reward-kit reward functions with Hugging Face’s Transformer Reinforcement Learning (TRL) library.

Overview

TRL is a library designed for fine-tuning language models using reinforcement learning techniques. It includes implementations of various RL algorithms including:

GRPO (Group Relative Policy Optimization) - used in DeepSeek-R1
PPO (Proximal Policy Optimization)
DPO (Direct Preference Optimization)
ORPO (Optimal RLHF Policy Optimization)

These examples demonstrate how to integrate reward-kit’s reward functions with TRL for model fine-tuning. This document serves as the primary guide for the scripts found in the examples/trl_integration/ directory.

Example Scripts in `examples/trl_integration/`

The examples/trl_integration/ directory contains several Python scripts:

grpo_example.py: Demonstrates using reward functions with the Group Relative Policy Optimization (GRPO) trainer from TRL. This is a key example showing:
- Creating reward functions compatible with reward-kit.
- Converting them to the TRL-compatible format.
- Combining multiple reward functions.
- Dataset preparation for TRL.
- Setting up the GRPO trainer.
ppo_example.py: Likely demonstrates integration with TRL’s Proximal Policy Optimization (PPO) trainer.
minimal_deepcoder_grpo_example.py: A more focused GRPO example, possibly related to the DeepCoder dataset or a simplified setup.
working_grpo_example.py: Another GRPO variant, perhaps a more tested or stable version.
convert_dataset_to_jsonl.py: A utility script for dataset preparation.
trl_adapter.py: Contains adapter logic, likely used by the example scripts.
test_trl_integration.py: Pytest file for testing the integration.

Running the Examples

Most examples are run directly as Python scripts. Ensure your virtual environment is active (source .venv/bin/activate). 1. GRPO Example (grpo_example.py) This example demonstrates using reward functions with the Group Relative Policy Optimization (GRPO) trainer from TRL. It shows:

Creating format and accuracy reward functions compatible with reward-kit
Converting these reward functions to TRL-compatible format
Combining multiple reward functions with weights
Preparing a dataset in the format expected by TRL
Setting up the GRPO trainer with these reward functions

# Run the GRPO example
python examples/trl_integration/grpo_example.py

(Check the script itself or accompanying comments for any specific dataset or model requirements.) 2. Other Examples (e.g., ppo_example.py, minimal_deepcoder_grpo_example.py)

# Example for PPO
python examples/trl_integration/ppo_example.py

# Example for Minimal DeepCoder GRPO
python examples/trl_integration/minimal_deepcoder_grpo_example.py

(Always refer to the comments within each script for specific instructions or dependencies, as they might vary.)

Prerequisites

To run these examples you’ll need the optional TRL dependencies:

pip install "reward-kit[trl]"

For the GRPO example, you might also need:

pip install math_verify  # For math verification in the reward function

Key Concepts

Reward Function Format

TRL expects reward functions that:

Accept batch inputs (either string lists or message arrays)
Return a list of float reward values
Handle both completion-only inputs and full conversation histories

Dataset Preparation

The examples show how to:

Load datasets from Hugging Face
Format them for TRL (adding system prompts, mapping fields)
Structure datasets properly for reward functions

Reward Combining

For fine-tuning models like DeepSeek-R1, multiple reward functions are typically combined:

Format rewards (encouraging proper tag usage)
Content rewards (accuracy, helpfulness, etc.)

The examples demonstrate proper weighting and normalization techniques.

Integration with reward-kit’s RewardFunction Class

reward-kit’s RewardFunction class includes a get_trl_adapter() method that converts any reward function into the format expected by TRL. This makes it easy to use existing reward functions from reward-kit with TRL trainers.

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

Trl integration overview

TRL Integration Examples

Overview

Example Scripts in `examples/trl_integration/`

Running the Examples

Prerequisites

Key Concepts

Reward Function Format

Dataset Preparation

Reward Combining

Integration with reward-kit’s RewardFunction Class

References

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​TRL Integration Examples

​Overview

​Example Scripts in examples/trl_integration/

​Running the Examples

​Prerequisites

​Key Concepts

​Reward Function Format

​Dataset Preparation

​Reward Combining

​Integration with reward-kit’s RewardFunction Class

​References

TRL Integration Examples

Overview

Example Scripts in `examples/trl_integration/`

Running the Examples

Prerequisites

Key Concepts

Reward Function Format

Dataset Preparation

Reward Combining

Integration with reward-kit’s RewardFunction Class

References