Skip to main content
Quick reference for all reinforcement fine-tuning parameters. Most experiments converge with the defaults below. Change them only when you have a clear hypothesis.
For guidance on when to change these parameters, see the Parameter Tuning guide.

Training parameters

FlagDefaultValid rangeWhen to change
--epochs11 – 10 (whole numbers only)Add 1-2 more passes if the reward still climbs steadily near the end of training. Too many epochs risks over-fitting.
--batch-size32 k tokensHardware-boundedLower if you hit OOM; raise only when GPUs have >30 % headroom.
--learning-rate1 e-41 e-5 – 5 e-4Decrease when the reward spikes then collapses; increase when the curve plateaus too early.
--lora-rank84 – 128 (powers of 2)Higher ranks give more capacity but require more GPU memory; stay ≤64 unless you have high-end GPUs.
--max-context-length8192 tokensUp to model limitRaise only when your prompts truncate; remember longer sequences consume quadratic compute.

Example usage

eval-protocol create rft \
  --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
  --output-model my-rft-model \
  --epochs 3 \
  --learning-rate 1e-4 \
  --lora-rank 16 \
  --max-context-length 16384

Rollout (sampling) parameters

During each training step, the model generates multiple responses with stochastic decoding. These parameters control that generation process.
FieldCLI flagDefaultRecommended rangeWhy it matters
Maximum tokens--inference-max-tokens2 04816 – 16 384Longer responses improve reward on summarisation / story tasks but add cost.
Temperature--inference-temperature0.70.1 – 2.0 ( > 0 only )Values below 0.1 converge towards greedy decoding and kill exploration; 0.5–1.0 is a sweet spot for RLHF.
Top-p--inference-top-p1.00 – 1Lower to 0.2–0.5 to clamp long-tail tokens when the reward penalises hallucinations.
Top-k--inference-top-k400 – 100 (0 = off)Combine with temperature for more creative exploration; keep ≤50 for latency.
n (choices)--inference-n42 – 8Policy-Optimization needs multiple candidates to compute a meaningful KL term; ≥2 is mandatory.
Extra body JSON--inference-extra-bodyemptyvalid JSONPass extra OpenAI-style params (e.g., stop, logit_bias). Invalid JSON is rejected.

Example usage

eval-protocol create rft \
  --base-model accounts/fireworks/models/llama-v3p1-8b-instruct \
  --output-model my-model \
  --inference-max-tokens 1024 \
  --inference-temperature 0.8 \
  --inference-top-p 0.9 \
  --inference-top-k 40 \
  --inference-n 6 \
  --inference-extra-body '{"stop":["\n\n"]}'

Quick reference by goal

GoalParameters to adjust
Faster convergenceepochs, tune learning-rate < 2× default
Safer / less toxictemperature, top_p, top_k
More creativetemperature ≈ 1 – 1.2, top_p 0.9
Cheaper roll-outsn, max_tokens, batch size
Higher capacitylora-rank, but monitor memory usage

Important constraints

Temperature must be > 0

Greedy sampling (temperature 0) is deterministic and collapses exploration, often leading to mode-dropping and repetitive text.

At least 2 rollouts required

Policy optimization needs multiple candidates per prompt to compute a meaningful KL divergence term. Setting --inference-n 1 will fail.

Range enforcement

The UI and CLI enforce the ranges shown above. Out-of-bound values throw an Invalid rollout parameters error immediately, saving wasted GPU hours.

Next steps