Modern AI apps rarely run on a single model forever. Teams iterate, swap providers, and increasingly run open-source models locally for privacy, latency, and cost. This post shows how to use Fireworks Eval Protocol to do robust model picking and how to host models locally with Ollama so you can replace OpenAI usage at scale—without rewriting your app logic.
We'll walk through two real examples in this repo:
The core idea: keep your evaluation harness the same; only swap the model backend using an OpenAI-compatible endpoint (Ollama).
completion_params
(model name, provider, base_url, etc.).123456789101112
In this example, the agent is created from completion_params
and automatically supports either OpenAI or Ollama depending on provider
and base_url
.
1234567891011121314151617
To add a local OSS model (via Ollama), include an entry like the one below in completion_params alongside any OpenAI models you want to compare against:
123
Run the evaluation locally:
123
The harness executes multiple runs, collects assistant outputs, and scores them with an LLM judge, so you can compare models head-to-head with the same prompts and tasks.
From running the evaluation, we can see that qwen3:8b is actually better than gpt-4o-mini out of the box, so for this task, if I was using gpt-4o-mini, I can now just seamlessly transition to qwen3:8b local! It is definitely still worse than gpt-5-nano and other bigger model remotely. Unfortunately qwen3:4b and granite4:micro is still doing very badly on these tasks, and we really hope IBM can put out better small models for us to experiment with in the future.
Already logging to Langfuse? You can convert those traces into evaluation rows and judge them—then point the judge at a local model by changing only completion_params
.
The data loader pulls traces:
123456789
Swap in a local model by parametrizing with an Ollama-backed entry:
12345678910
The evaluation harness remains unchanged; the judge remains unchanged; your traces remain the same. Only the model backend switches, making it safe and fast to validate replacement candidates.
Fireworks Eval Protocol lets you make data-driven model choices, test replacements quickly, and migrate from OpenAI to local open-source models on Ollama—all with minimal code changes. Keep your evaluation and logging workflows intact; just point the model at a different backend.