Math with Formatting Example
This guide explains how to evaluate models on math word problems using thereward-kit run command, focusing on both the accuracy of the numerical answer and the adherence to a specific response format (e.g., <think>...</think><answer>...</answer>). This example uses the GSM8K dataset.
Overview
The “Math with Formatting” example demonstrates a multi-metric evaluation:- Accuracy Reward: Assesses if the extracted numerical answer is correct.
- Format Reward: Checks if the model’s response follows the prescribed XML-like structure for thoughts and the final answer. The final score reported is typically an average of these two rewards.
- Dataset: Uses the
gsm8kdataset, configured viagsm8k_math_with_formatting_prompts.yamlwhich adds specific system prompts to guide the model’s output format. - Reward Logic: The core evaluation logic is in
examples/math_with_formatting/main.py, referenced in the run configuration asexamples.math_with_formatting.main.evaluate. - System Prompt Example (from
gsm8k_math_with_formatting_prompts.yaml):
Setup
- Environment: Ensure your Python environment has
reward-kitand its development dependencies installed: - API Key: The default configuration (
run_math_with_formatting_eval.yaml) uses a Fireworks AI model (e.g.,accounts/fireworks/models/qwen3-235b-a22b). Ensure yourFIREWORKS_API_KEYis set in your environment or a.envfile.
Running the Evaluation
The primary configuration for this example isexamples/math_with_formatting/conf/run_math_with_formatting_eval.yaml.
- Activate your virtual environment:
- Execute the
reward-kit runcommand from the root of the repository:
Overriding Parameters
You can modify parameters via the command line. For instance:- Limit samples:
(The default in the example config is
limit_samples: 2). - Change generation model:
Expected Output
The command will:- Load the GSM8K dataset as configured by
gsm8k_math_with_formatting_prompts.yaml. - Generate model responses using the specified model (default:
qwen3-235b-a22b). - Evaluate responses using the logic in
examples.math_with_formatting.main.evaluate, which combines accuracy and format checks. - Print a summary to the console.
- Save detailed results to a JSONL file (e.g.,
math_with_formatting_example_results.jsonl) in a timestamped directory underoutputs/(the exact path is determined by Hydra, typically based on the current date/time). - Save prompt/response pairs to
preview_input_output_pairs.jsonlin the same output directory.
evaluation_score (average of accuracy and format) and a breakdown in evaluation_metrics for accuracy_reward and format_reward.
Key Components
examples/math_with_formatting/main.py: Contains theevaluate()function with the core reward logic, including:accuracy_reward_fn: Extracts and compares numerical answers.format_reward_fn: Checks for the<think>...</think><answer>...</answer>structure.
- Dataset Configuration: Uses a derived dataset (
gsm8k_math_with_formatting_prompts.yaml) to add specific system prompts to the basegsm8kdataset.