GSM8K Fine-Tuning

In this quickstart, you’ll train a small language model to solve mathematical reasoning problems from the GSM8K (Grade School Math 8K) dataset.

What you’ll learn

How to set up and run a math evaluation using the Eval Protocol SDK
How to launch an RFT job from the command line
How to monitor training progress and evaluate accuracy improvements

Prefer a notebook experience? You can also run this tutorial in Google Colab. Note that Colab requires billing enabled on your Google account.

Prerequisites

Python 3.10+
A Fireworks API key with permissions to launch RFT jobs (stored in your shell or .env)
Command-line access (terminal or shell)

1. Install dependencies

Install the latest eval-protocol SDK directly from the main branch and make sure pytest is on the path.

python -m pip install --upgrade pip
python -m pip install pytest git+https://github.com/eval-protocol/python-sdk.git

2. Download the evaluator and dataset

Run this Python script to download two files from the Eval Protocol repository into a folder on your machine called gsm8k_artifacts/.

Test script (test_pytest_math_example.py): Defines how to evaluate math answers
Sample dataset (gsm8k_sample.jsonl): Contains example math problems to test on

tutorial/download_gsm8k_assets.py

from pathlib import Path
import requests

ARTIFACT_ROOT = Path("gsm8k_artifacts")
TEST_PATH = ARTIFACT_ROOT / "tests" / "pytest" / "gsm8k" / "test_pytest_math_example.py"
DATASET_PATH = ARTIFACT_ROOT / "development" / "gsm8k_sample.jsonl"

files_to_download = {
    TEST_PATH: "https://raw.githubusercontent.com/eval-protocol/python-sdk/main/tests/pytest/gsm8k/test_pytest_math_example.py",
    DATASET_PATH: "https://raw.githubusercontent.com/eval-protocol/python-sdk/main/development/gsm8k_sample.jsonl",
}

for local_path, url in files_to_download.items():
    local_path.parent.mkdir(parents=True, exist_ok=True)
    response = requests.get(url, timeout=30)
    response.raise_for_status()
    local_path.write_bytes(response.content)
    print(f"Saved {url} -> {local_path}")

Expected output:

Saved https://raw.githubusercontent.com/.../test_pytest_math_example.py -> gsm8k_artifacts/tests/pytest/gsm8k/test_pytest_math_example.py
Saved https://raw.githubusercontent.com/.../gsm8k_sample.jsonl -> gsm8k_artifacts/development/gsm8k_sample.jsonl

3. Run the evaluation

First, start the local UI server to view evaluation results. Open a terminal and run:

ep logs

This will start a local server and should automatically open a browser window at http://localhost:8000. Keep this terminal running. Then, in a new terminal, run the test script to evaluate the model on sample math problems:

cd gsm8k_artifacts
pytest -q tests/pytest/gsm8k/test_pytest_math_example.py::test_math_dataset -s

As the test runs, you’ll see evaluation results appear in the browser showing detailed logs for each problem the model attempts. The pytest script will also register your evaluator and dataset with Fireworks automatically, so you can use them in the next step for RFT.

GSM8K evaluation UI showing model scores and trajectories

4. Start training

First, set your Fireworks API key so the Fireworks CLI can authenticate you:

export FIREWORKS_API_KEY="<your-fireworks-key>"

Next, we’ll launch the RFT job using the evaluator and dataset you just registered. We’re using a small base model (qwen3-0p6b) to keep training fast and inexpensive. Because your evaluator and dataset were already registered with Fireworks in the last step, we don’t need to specify them again here.

cd ..
eval-protocol create rft 
    --base-model accounts/fireworks/models/qwen3-0p6b

The CLI will output dashboard links where you can monitor your training job in real-time.

GSM8K evaluation score showing upward trajectory

You can also store your API key in a .env file instead of exporting it each session.

Monitor your training progress

Your RFT job is now running. You can monitor progress in the dashboard links provided by the CLI output.

Evaluate accuracy regularly

Re-run the pytest evaluation command to measure your model’s performance on new checkpoints:

cd gsm8k_artifacts
pytest -q tests/pytest/gsm8k/test_pytest_math_example.py::test_math_dataset -s

This helps you see how your model’s accuracy improves over time and decide when to stop training.

Customize your evaluation

You can adjust the evaluation logic to better fit your needs:

Modify reward shaping: Edit the scoring logic in test_pytest_math_example.py to match your answer format expectations
Use your own data: Replace the sample dataset by either editing the JSONL file locally or passing --dataset-jsonl when creating the RFT job

What’s happening behind the scenes

Understanding the training workflow:

Evaluation registration: The pytest script evaluates a small GSM8K subset using numeric answer checking, then automatically registers both your evaluator and dataset with Fireworks
RFT job creation: The create rft command connects your registered evaluator and dataset to a Reinforcement Fine-Tuning job for your chosen base model
Continuous improvement: As training progresses, evaluation scores on the held-out set reflect improved accuracy, allowing you to iterate quickly before scaling to larger experiments

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

What you’ll learn

Prerequisites

1. Install dependencies

2. Download the evaluator and dataset

3. Run the evaluation

4. Start training

Monitor your training progress

What’s happening behind the scenes

Next steps

Try another example

Learn more about RFT

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​What you’ll learn

​Prerequisites

​1. Install dependencies

​2. Download the evaluator and dataset

​3. Run the evaluation

​4. Start training

​Monitor your training progress

​What’s happening behind the scenes

​Next steps

Try another example

Learn more about RFT

What you’ll learn

Prerequisites

1. Install dependencies

2. Download the evaluator and dataset

3. Run the evaluation

4. Start training

Monitor your training progress

What’s happening behind the scenes

Next steps