Parameter Tuning

Overview

Reinforcement fine-tuning uses two categories of parameters to control model training: training parameters that govern how the model learns, and rollout (sampling) parameters that control how the model generates responses during training. Most experiments converge well with the default values. Adjust parameters only when you have a clear hypothesis based on your training metrics and reward curves.

Training Parameters

Core parameters that control how your model learns during the training process.

Learning Rate

What it does: Controls how aggressively the model updates its weights during each training step. Think of it as the “step size” when descending the loss landscape.Default: 1e-4 (0.0001)
Valid range: 1e-5 to 5e-4How it affects outcome:

Too high → Unstable training where reward spikes briefly then collapses as the model overshoots optimal weights.
Too low → Painfully slow convergence. The reward curve plateaus too early before reaching optimal performance.
Just right → Steady, consistent reward improvement throughout training.

When to adjust:

Decrease when you see reward spikes followed by crashes in your training metrics
Increase when the reward curve plateaus too early and stops improving
Keep changes within 2× of the default value

Epochs

What it does: The number of complete passes through your training dataset. Each epoch processes every example once.Default: 1
Valid range: 1 to 10 (whole numbers only)How it affects outcome:

Too few → The model hasn’t had enough exposure to learn patterns from your data
Too many → Overfitting risk where the model memorizes the training set instead of generalizing
Just right → Reward curve shows steady improvement and plateaus near the end of training

When to adjust:

Add 1-2 more epochs if the reward is still climbing steadily at the end of training
Keep at 1 for most tasks—the default works well
Watch your reward curves to detect when adding more epochs stops helping

LoRA Rank

What it does: Controls the number of trainable parameters in your LoRA adapter. LoRA (Low-Rank Adaptation) adds small adapter layers to the base model rather than training all weights. Higher rank means more capacity to learn new behaviors.Default: 8
Valid range: 4 to 128 (must be powers of 2: 4, 8, 16, 32, 64, 128)How it affects outcome:

Lower rank (4-8) → Faster training, less GPU usage, but may lack capacity for complex tasks
Higher rank (32-128) → More learning capacity, but requires significantly more GPUs and risks overfitting
Just right (8-16) → Balances capacity and efficiency for most tasks

When to adjust:

Increase for complex reasoning tasks or when the model struggles to learn desired behaviors
Keep ≤64 unless you have high-end GPUs
Consider task complexity: simple style changes need lower rank, complex reasoning needs higher

Batch Size

What it does: The amount of data (measured in tokens) processed in each training step before updating model weights.

Unlike traditional batch sizes that count sequences (e.g., 32 or 64 sequences), Fireworks RFT uses token-based batch sizing. For example, with an 8k max sequence length, a 64k batch size allows up to 8 sequences per batch (64k tokens ÷ 8k tokens/sequence = 8 sequences).

Default: 32k tokens
Valid range: Hardware-dependentHow it affects outcome:

Smaller batches → Noisier gradient updates that may help exploration, but slower training
Larger batches → Smoother, more stable updates and faster training throughput, but requires more GPU memory
Just right → Maximizes GPU utilization without running out of memory

When to adjust:

Decrease when you hit out-of-memory (OOM) errors
Increase only when GPUs show >30% memory headroom and you want faster training
Most users should stick with the default

Rollout (Sampling) Parameters

Parameters that control how the model generates responses during training rollouts.

Temperature

What it does: Controls the randomness of the model’s token selection during generation. Higher temperature = more random/creative, lower = more deterministic/focused.Default: 0.7
Valid range: 0.1 to 2.0 (must be >0)How it affects outcome:

0.0-0.1 (near-greedy) → Deterministic outputs with no exploration. Leads to mode collapse and repetitive text. Avoid in RFT.
0.5-1.0 (sweet spot) → Good balance of exploration and coherence. Ideal for most RLHF applications.
>1.2 (high randomness) → Very creative but potentially incoherent outputs

When to adjust:

Lower (0.3-0.5) for tasks requiring precision, factual accuracy, or safety (less toxic outputs)
Raise (1.0-1.2) for creative tasks like story generation or when you need more diverse rollout exploration
Never use 0.0—greedy sampling breaks RFT by eliminating exploration

Top-p (Nucleus Sampling)

What it does: Dynamically limits token sampling to the smallest set of tokens whose cumulative probability exceeds threshold p. Only considers the most probable tokens that together make up the top p% of probability mass.Default: 1.0 (considers all tokens)
Valid range: 0 to 1How it affects outcome:

Lower values (0.2-0.5) filter out long-tail, low-probability tokens that often cause hallucinations
Higher values (0.9-1.0) allow more diversity in outputs
Prevents the model from selecting very unlikely tokens that may be nonsensical

When to adjust:

Lower to 0.2-0.5 when your reward function penalizes hallucinations or factual errors
Keep at 0.9-1.0 for creative tasks that benefit from diverse vocabulary
Works well in combination with temperature for fine-grained control

Top-k

What it does: Limits sampling to only the K most probable tokens at each step. A fixed-size cutoff (unlike top-p which is dynamic).Default: 40
Valid range: 0 to 100 (0 = disabled)How it affects outcome:

Similar to top-p but uses a fixed number of candidates instead of a probability threshold
Lower k = more focused, less diverse outputs
Higher k = more exploration and creativity

When to adjust:

Combine with temperature (e.g., temp 0.8 + top-k 40) for balanced creative exploration
Keep ≤50 to maintain reasonable inference latency
Consider using top-p instead for most use cases—it adapts better to varying probability distributions

Number of Rollouts (n)

What it does: How many different responses the model generates for each prompt during training. The policy optimization algorithm compares these candidates to compute the KL divergence term and learn which responses are better.Default: 4
Valid range: 2 to 8 (minimum 2 required)How it affects outcome:

n=1 → Not allowed. Policy optimization requires multiple candidates to learn from comparisons
n=2-4 → Minimal viable exploration. Faster and cheaper but less signal for learning
n=4-8 → Good balance of learning signal and cost for most tasks
n>8 → Diminishing returns. Significantly slower and more expensive with marginal quality gains

When to adjust:

Increase to 6-8 when you need higher quality and cost isn’t a concern
Keep at 4 for most experiments—it’s the sweet spot
Never set to 1—this will cause training to fail
Consider the tradeoff: more rollouts = better signal but linearly higher cost

Max Tokens

What it does: The maximum number of tokens the model can generate in a single response during rollouts.Default: 2048
Valid range: 16 to 16384How it affects outcome:

Directly affects task completion: too short and the model can’t finish complex tasks
Longer responses improve reward on summarization, story generation, and reasoning tasks
Linearly increases training cost—every token generated costs compute

When to adjust:

Increase when your tasks require longer reasoning chains, detailed summaries, or complex multi-step solutions
Decrease to reduce costs for tasks with naturally short outputs (classification, short-form Q&A)
Monitor your reward curves: if the model is cutting off mid-response, increase max tokens

Parameter Interactions

Parameters don’t work in isolation—they interact in important ways.

Temperature + Top-p/Top-k

These three work together to control sampling behavior. Using all three gives you fine-grained control:

Temperature sets the overall randomness
Top-p dynamically filters by probability mass
Top-k sets a hard limit on candidate tokens

Example: temperature=0.8, top_p=0.9, top_k=40 gives creative but controlled outputs.

Learning Rate + Batch Size

Larger batch sizes provide more stable gradients, which may allow for slightly higher learning rates. However, the default learning rate is tuned for the default batch size—only adjust if you have evidence from your training curves.

LoRA Rank + Model Size

Larger base models (70B+) may need higher LoRA ranks to capture complex behaviors, but they also require more resources. For smaller models (<13B), rank 8-16 is usually sufficient.

Tuning Strategies

Best practices for adjusting parameters to achieve your training goals.

Start with Defaults

The default parameters are carefully tuned to work well for most RFT tasks. Don’t change them unless you have a clear hypothesis based on your training metrics.Run at least one baseline experiment with defaults before making any adjustments. This gives you:

A performance benchmark to compare against
Understanding of whether parameter tuning is actually needed
Evidence about which metrics need improvement

Many successful RFT jobs use all default parameters.

One Change at a Time

When you do adjust parameters, change only one at a time and measure the impact on your reward curves and evaluation metrics.Good workflow:

Run baseline with defaults
Identify specific issue (e.g., reward crashes, slow convergence)
Change ONE parameter that should address that issue
Compare results
Repeat

Avoid: Changing multiple parameters simultaneously—you won’t know which change caused the improvement or regression.

Track Everything

Use Weights & Biases integration to:

Compare training curves across experiments
Track reward progression over time
Monitor hardware utilization (memory, GPU usage)
Log all hyperparameters automatically

This makes it easy to identify which parameter changes actually helped and which hurt performance.

Common Patterns

Quick reference for goal-directed parameter tuning:

Faster convergence → ↑ epochs (add 1-2), tune learning rate (stay <2× default)
Better quality → ↑ temperature (1.0-1.2), ↑ rollouts (6-8), ↑ max tokens
Safer/less toxic → ↓ temperature (0.3-0.5), ↓ top-p (0.5), ↓ top-k
More creative → ↑ temperature (1.0-1.2), top-p = 0.9
Lower cost → ↓ rollouts, ↓ max tokens, ↓ batch size
Higher capacity → ↑ LoRA rank (16-32), but monitor memory usage
Prevent overfitting → Keep epochs = 1, consider lower LoRA rank

Next Steps

Parameters Reference

Quick lookup table for all parameters with defaults and valid ranges

Launch Training

Start your RFT job with your chosen parameters

GSM8K Quickstart

Hands-on tutorial showing parameter tuning in practice

RFT Overview

Learn about the RFT training process and workflow

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

Overview

Training Parameters

Rollout (Sampling) Parameters

Parameter Interactions

Tuning Strategies

Next Steps

Parameters Reference

Launch Training

GSM8K Quickstart

RFT Overview

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​Overview

​Training Parameters

​Rollout (Sampling) Parameters

​Parameter Interactions

​Tuning Strategies

​Next Steps

Parameters Reference

Launch Training

GSM8K Quickstart

RFT Overview

Overview

Training Parameters

Rollout (Sampling) Parameters

Parameter Interactions

Tuning Strategies

Next Steps