Join the Fireworks Startups Program and unlock credits, expert support, and community to scale fast. Join here

Blog
Llm Judge Eval Protocol Ollama

LLM on the edge: Model picking with Fireworks Eval Protocol + Ollama

LLM on the edge: Model picking with Fireworks Eval Protocol + Ollama
image

Modern AI apps rarely run on a single model forever. Teams iterate, swap providers, and increasingly run open-source models locally for privacy, latency, and cost. This post shows how to use Fireworks Eval Protocol to do robust model picking and how to host models locally with Ollama so you can replace OpenAI usage at scale—without rewriting your app logic.

We'll walk through two real examples in this repo:

  • End-to-end agent evaluation on the Chinook dataset (PydanticAI)
  • LLM-judge over Langfuse traces you already have in production

The core idea: keep your evaluation harness the same; only swap the model backend using an OpenAI-compatible endpoint (Ollama).

image

Why this approach works

  • Standard interface: Eval Protocol treats models as swappable via completion_params (model name, provider, base_url, etc.).
  • OpenAI-compatible: Ollama exposes an OpenAI-style API locally, so clients keep working with only config changes.
  • Evidence-based model picking: Run apples-to-apples comparisons across datasets and traces. Keep the judge and scoring constant while changing only the model backend.

Prerequisites

  • This repo set up locally (Python environment ready)
  • macOS or Linux
  • Ollama installed and running

Install and run Ollama

123456789101112
brew install ollama
ollama serve & disown
# Pull a model to evaluate locally (you can choose another, e.g. llama3.1)
ollama pull qwen3:4b
# Point OpenAI-compatible clients at the local Ollama server
export OLLAMA_OPENAI_BASE_URL=http://localhost:11434/v1
# Many OpenAI clients require a key even if unused by Ollama
export OPENAI_API_KEY=dummy

Example 1: End-to-end agent evaluation (Chinook + PydanticAI)

In this example, the agent is created from completion_params and automatically supports either OpenAI or Ollama depending on provider and base_url.

1234567891011121314151617
def agent_factory(config: RolloutProcessorConfig) -> Agent:
model_name = config.completion_params["model"]
provider_param = config.completion_params.get("provider")
reasoning = config.completion_params.get("reasoning")
settings = OpenAIChatModelSettings(openai_reasoning_effort=reasoning)
base_url = config.completion_params.get("base_url")
api_key = config.completion_params.get("api_key") or os.getenv("OPENAI_API_KEY") or "dummy"
if base_url or provider_param == "ollama":
provider = OpenAIProvider(
api_key=api_key,
base_url=base_url or os.getenv("OLLAMA_OPENAI_BASE_URL", "<http://localhost:11434/v1>"),
)
else:
provider = provider_param or "openai"
model = OpenAIChatModel(model_name, provider=provider, settings=settings)
return setup_agent(model)

To add a local OSS model (via Ollama), include an entry like the one below in completion_params alongside any OpenAI models you want to compare against:

123
{"model": "gpt-5-nano-2025-08-07"},
{"model": "qwen3:8b", "provider": "ollama", "base_url": os.getenv("OLLAMA_OPENAI_BASE_URL", "<http://localhost:11434/v1>")},
{"model": "qwen3:4b", "provider": "ollama", "base_url": os.getenv("OLLAMA_OPENAI_BASE_URL", "<http://localhost:11434/v1>")},

Run the evaluation locally:

123
export OLLAMA_OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=dummy
pytest tests/chinook/pydantic/test_pydantic_complex_queries.py -q

The harness executes multiple runs, collects assistant outputs, and scores them with an LLM judge, so you can compare models head-to-head with the same prompts and tasks.

From running the evaluation, we can see that qwen3:8b is actually better than gpt-4o-mini out of the box, so for this task, if I was using gpt-4o-mini, I can now just seamlessly transition to qwen3:8b local! It is definitely still worse than gpt-5-nano and other bigger model remotely. Unfortunately qwen3:4b and granite4:micro is still doing very badly on these tasks, and we really hope IBM can put out better small models for us to experiment with in the future.

image

Example 2: Judge Langfuse traces and swap the model to Ollama

Already logging to Langfuse? You can convert those traces into evaluation rows and judge them—then point the judge at a local model by changing only completion_params.

The data loader pulls traces:

123456789
def langfuse_data_generator():
adapter = create_langfuse_adapter()
return adapter.get_evaluation_rows(
to_timestamp=datetime(2025, 9, 12, 0, 11, 18),
limit=711,
sample_size=50,
sleep_between_gets=3.0,
max_retries=5,
)

Swap in a local model by parametrizing with an Ollama-backed entry:

12345678910
@pytest.mark.parametrize(
"completion_params",
[
{
"model": "qwen3:4b",
"provider": "ollama",
"base_url": os.getenv("OLLAMA_OPENAI_BASE_URL", "<http://localhost:11434/v1>"),
}
],
)

The evaluation harness remains unchanged; the judge remains unchanged; your traces remain the same. Only the model backend switches, making it safe and fast to validate replacement candidates.

Takeaway

Fireworks Eval Protocol lets you make data-driven model choices, test replacements quickly, and migrate from OpenAI to local open-source models on Ollama—all with minimal code changes. Keep your evaluation and logging workflows intact; just point the model at a different backend.