Evaluation workflows
Evaluation Workflows
This guide explains the lifecycle of developing, testing, and deploying reward functions and evaluation setups within the Reward Kit.
Development Workflow Overview
The typical workflow involves:
- Dataset Configuration: Defining how your data is loaded and prepared (see Dataset Configuration Guide).
- Reward Function Implementation: Writing the logic to evaluate model responses.
- Local Evaluation (using
reward-kit run
): Running evaluations locally using Hydra-based configurations to generate responses and score them. - Previewing Results (using
reward-kit preview
): Inspecting or re-evaluating generated outputs. - Deployment: Making the reward function or evaluator available as a service.
- Integration: Using the deployed evaluator in RLHF training or other workflows.
1. Dataset Configuration
Before evaluation, you need to configure your dataset. This involves setting up YAML files (typically in conf/dataset/
or an example’s conf/dataset/
directory) to define how raw data is sourced, processed, and formatted (e.g., adding system prompts).
Refer to the Dataset Configuration Guide for detailed instructions.
2. Reward Function Implementation
Create your reward function using the @reward_function
decorator or by structuring your evaluation logic within a script that can be called by an evaluation configuration.
Example: Basic Reward Function
This function can then be referenced in your evaluation configuration.
3. Local Evaluation with reward-kit run
The primary method for running local evaluations is the reward-kit run
CLI command, which uses Hydra for configuration. This command handles generating model responses (if needed) and evaluating them according to your specified dataset and reward logic.
Setting up the Configuration
You’ll need a main evaluation configuration YAML file (e.g., run_my_eval.yaml
) that specifies:
- The dataset to use (referencing configurations from
conf/dataset/
). - Model generation parameters (model name, API keys, etc.).
- The reward function or evaluation script to use.
- Other evaluation parameters (e.g., sample limits).
Refer to the Hydra Configuration for Examples guide and specific examples like examples/math_example/conf/run_math_eval.yaml
.
Running the Evaluation
This command will:
- Load the dataset as per your configuration.
- Generate responses from the specified model.
- Apply the configured reward function(s).
- Save detailed results (e.g.,
run_my_eval_results.jsonl
) and prompt/response pairs (e.g.,preview_input_output_pairs.jsonl
) to a timestamped output directory (usually underoutputs/
).
4. Previewing and Analyzing Results
After a reward-kit run
, you can use reward-kit preview
to inspect the generated preview_input_output_pairs.jsonl
or re-evaluate them with different/updated metrics.
Using the CLI for Preview
This is useful for iterating on reward functions or comparing different evaluation approaches on the same set of generated responses.
Programmatic Analysis
You can also load the *.jsonl
result files programmatically (e.g., with Pandas) for custom analysis, plotting, or reporting.
5. Deployment
Once your reward function is developed and tested locally, you can deploy it as an evaluator. The primary methods are using the deploy()
method on a reward function object or the reward-kit deploy
CLI command.
Using the deploy()
Method (Programmatic)
If you have a reward function object (created with @reward_function
), you can deploy it directly:
You can also specify providers if needed:
Using the CLI (reward-kit deploy
)
The reward-kit deploy
command is suitable for deploying reward functions defined in script files. The --metrics-folders
argument should point to the directory containing your reward function script (e.g., a main.py
with the @reward_function
decorator).
For more details on reward-kit deploy
, see the CLI Reference.
Lower-level create_evaluation
Function
For more direct control, or if not using the @reward_function
decorator’s deploy
method, you can use the create_evaluation
function from reward_kit.evaluation
. This is generally for more advanced use cases or internal tooling.
6. Integration with Training
Using in an RL Training Job
Once deployed, use the evaluator in an RL training job:
Programmatic Integration with TRL
For programmatic integration with the Transformer Reinforcement Learning (TRL) library:
7. Best Practices
- Iterative Development: Start simple, test thoroughly, and refine your reward function. Use
reward-kit run
andreward-kit preview
extensively. - Version Control: Use version control for your reward functions, configurations, and datasets.
- Sample Diversity: Test with a diverse set of samples to ensure robustness.
- Documentation: Document the behavior and assumptions of your reward function.
- Error Handling: Include robust error handling in your reward logic to prevent evaluation failures.
- Logging: Add detailed logging within your reward functions for easier debugging.
Next Steps
Now that you understand the complete workflow:
- Try creating a Basic Reward Function
- Explore Advanced Reward Functions with multiple metrics