The Benchmark Gap: What It Takes to Ship Kimi K2.5

Production Readiness Lessons from Kimi K2.5

The Benchmark Gap: What It Takes to Ship Kimi K2.5

Kimi K2.5 is live on Fireworks at ~1/10 the cost and 2-3x the speed of closed frontier models. As the fastest open-source provider of Kimi K2.5, Fireworks is seeing unprecedented model adoption. Kimi K2.5 is a landmark release for open models with benchmark results on par with top closed models and unprecedented visual coding quality. But enabling full quality in production requires more than just hosting the model.

Here's how Fireworks ensures that developers get the best quality on our platform and how that translates into specific edge cases.

Kimi K2.5 Benchmarks — Artificial Analysis Kimi K2.5 Chart

How We Approach Quality at Fireworks

Deploying frontier open models has taught us that quality emerges or degrades in the gaps: between the model and serving stack, between the chat template on Hugging Face and what’s running in the first-party API. Our quality process is built to systematically close gaps that benchmarks miss. We’ve applied this approach through multiple generations of models. We were the first to discover GPT-OSS quality bugs and pioneered KL divergence measurements to assess quantization quality.

What We Test

•Prompt formatting — chat template mismatches between Hugging Face and the model provider's API
•Tool calling — streaming, non-streaming, and grammar-constrained generation across multi-turn workflows
•Numerical correctness — precision matching against reference implementations via generations, logprobs, and KLD
•System behavior — timeouts, disconnect handling, and error codes under load
•SDK compatibility — common client SDKs and proxy services like OpenRouter

How We Test (In Layers)

•Deterministic unit tests for formatting, parsing, and tool handling corner cases
•One-turn benchmarks for basic sanity checks
•Agentic multi-turn benchmarks (where most subtle issues surface)
•Multimodal benchmarks for models like K2.5
•Periodic reruns of all tests to catch regressions

Production Readiness Lessons from Kimi K2.5

Bringing Kimi K2.5 into real-world production reinforced why comprehensive quality validation is needed beyond benchmark scores alone. The examples below highlight specific classes of issues that routinely surface when deploying frontier open models—and the engineering strategies we use to address them. Detailed technical breakdowns for each case are included in the appendix.

Guardrails between reasoning and tool execution — In production settings, models can prematurely terminate during internal reasoning unless decoding enforces clear phase boundaries. For Kimi K2.5, this surfaced as occasional end-of-sequence tokens emitted mid-thought while planning tool calls. We reached out to Kimi and implemented a fix using constrained generation to prevent premature termination (similar enforcement exists in the official API).
Inference settings affecting quality — Model providers prefer specific inference settings. Kimi’s post-launch guidance recommends specific temperature and top-p values by task, and their K2VV benchmark suite treats these as immutable. Fireworks applies defaults from providers whenever possible, but doesn't hard-lock these values because developers expect the flexibility to set their own.
Empty reasoning blocks handling — We’ve seen callers explicitly pass reasoning_content: None in previous messages which the prompt template didn’t handle correctly. This inlined ‘none’ in the prompt, confusing the model (see example below)

123456789
{
  "model": "accounts/fireworks/models/kimi-k2p5",
  "messages": [
    { "role": "assistant", "content": "Sure.", "reasoning_content": null },
    { "role": "user", "content": "Now call the tool." }
  ],
  "tools": [ /* ... */ ]
}

4. Launch day overload — Under launch-day load, a gateway misconfiguration returned HTTP 200 with empty bodies instead of 429 rate-limit errors. Because the responses appeared successful, retries never triggered and users saw silent failures with no indication anything had gone wrong.

Beyond specific errors: We see ongoing instability in model outputs when using thinking and tool calling together. We believe this is a model-level issue (for example, K2VV tool calling benchmark runs only in non-thinking mode) and are working with the Kimi team for potential fixes.

If You're Serving Open Models Yourself

•Don't trust the Hugging Face chat template blindly. Compare it against the model provider's own API output before deploying.
•Test tool calling in multi-turn, not just single-turn. Most failures only surface after several rounds of interleaved thinking and tool use.
•Watch for silent 200s under load. Your system can mask rate-limiting failures in ways that look like model problems.
•Validate structured output enforcement with optional fields. Rigid schema ordering will break models that emit valid JSON in a different field order.
•Match the provider's inference settings, or know why you're diverging. Temperature and top-p defaults aren't arbitrary; they affect output quality in ways benchmarks may not catch.

Taken together, these examples highlight a broader lesson: production quality is an end-to-end property. It emerges from the interaction between model behavior, decoding constraints, serving infrastructure, and client expectations—not from the model alone.

Kimi K2.5: Evaluation Results

Benchmarks don’t tell the full story, but they remain a useful sanity check for validating production readiness. We matched the official Kimi API across KVV results.

KVV is Kimi’s evaluation and validation suite, focused on the areas most vulnerable to infrastructure failure: tool calling correctness, structured outputs, and multimodal reasoning. We supplemented with longer agentic evaluations to catch what the benchmarks miss.

Text Benchmarks

All evaluations run in thinking mode unless otherwise noted.

Benchmark	Kimi Model Card	Kimi Official API (Measured by FW)	Fireworks API
SWE-Bench	76.8 (custom harness)	76.8	76.8
Terminal-Bench 2 (non-thinking)	50.8 (custom harness)		50.6
Tau2-bench Airline		65	68
AIME 2025	96.1	95.7	95.0
K2VV ToolCall: F1 (non-thinking)	84		83.9

Vision Benchmarks

Benchmark	Kimi Model Card	Fireworks API
MMMU Pro Vision	77.4	77.9
MMMU Pro Vision (non-thinking)	74.7	74.5
OCRBench	91	91.7

Note: All scores represent an average of 3 runs except Tau-Bench which is an average of 4 runs and SWEBench which is an average of 2 runs

Used harnesses for running benchmarks:

•AIME, MMMU Pro, OCRBench, K2VV ToolCall - using official Kimi’s KVV
•AIME: Eval Protocol
•SWEBench: SWE-Agent
•Terminal-Bench 2: Terminus 2
•Tau2-bench: Eval Protocol

What the Benchmarks Don’t Show

Some benchmarks abstract away important production considerations:

•SWE-bench uses a custom harness on Kimi's side, which explains part of the gap between model card and API measurements. The original SWE-bench harness also doesn't specify max_tokens, and default values vary across providers (to make evaluation simpler we increased Fireworks default to 32k output tokens).
•TerminalBench forces non-thinking mode due to harness limitations, making direct comparison to thinking-mode model card numbers difficult
•Tau-bench has inherent noise from its simulated customer interactions. This noisiness explains the difference between official API and Fireworks API scores.

Conclusion: We built this process for Kimi K2.5, but it reflects how we approach every model we ship at Fireworks. As models continue to advance, the gap between benchmark performance and production reliability will only grow unless infrastructure, decoding, and evaluation evolve alongside them.

This is the work we focus on: making frontier open models dependable in real systems. If you encounter edge cases we haven’t covered, let us know—we’re committed to closing the benchmark gap together.

Appendix - Detailed Specific Cases

EOS token in thinking blocks: “it stopped mid‑thought and never called the tool”

The story

Under load, some tool-required requests would occasionally end early while the model was still “thinking”. For developers, that looked like a flaky agent: 99% of the time it called the tool; 1% of the time it returned nothing useful (or just a fragment) and never reached the tool call.

What users saw

•Streaming: you get a few reasoning_content deltas, then the stream ends ([DONE]) without any tool_calls.
•Non‑streaming: message.tool_calls is missing, and finish_reason is "stop" or "length".

Canonical example (streaming request)

1234567891011121314151617181920212223242526
curl -sS -N "https://api.fireworks.ai/inference/v1/chat/completions" \
  -H "Authorization: Bearer $FIREWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "accounts/fireworks/models/kimi-k2p5",
    "messages": [
      { "role": "system", "content": "You must call lookup_order exactly once. Do not answer normally." },
      { "role": "user", "content": "Where is order ORD-9012 right now?" }
    ],
    "tools": [{
      "type": "function",
      "function": {
        "name": "lookup_order",
        "strict": true,
        "parameters": {
          "type": "object",
          "additionalProperties": false,
          "properties": { "orderId": { "type": "string" } },
          "required": ["orderId"]
        }
      }
    }],
    "tool_choice": "required",
    "stream": true
  }'

Failure mode (simplified SSE shape)

1234
data: {"choices":[{"delta":{"reasoning_content":"...thinking..."}}]}
data: {"choices":[{"delta":{"reasoning_content":"...still thinking..."}}]}
data: [DONE]

What we changed

We added a constrained-generation guardrail that bans special tokens (including EOS) during the “thinking” phase when the request uses trigger-based grammar activation. The goal is simple: don’t allow “premature stop” before the model reaches the delimiter that activates structured/tool generation.

Inference settings affecting quality (and K2.5 preflight immutables): “why is my temperature rejected?”

The story

Many teams standardize sampling knobs globally (e.g., “always use temperature=1 as per moonshot’s recommendation”). But Kimi K2.5’s vendor verifier expects certain sampling settings by mode, and some deployments enforce those as immutable to match preflight expectations. These were communicated to us by moonshot as hard defaults that raise an HTTP 400 response, when the user tries to override these params.

What users saw

•Requests fail fast with a clear 4xx error if they explicitly override certain sampling fields.

Canonical request

1234567
{
  "model": "accounts/fireworks/models/kimi-k2p5",
  "messages": [{ "role": "user", "content": "Say hello." }],
  "temperature": 0.2,
  "top_p": 0.7
}

Canonical error (example)

123456
{
  "error": {
    "message": "Parameter 'temperature' is immutable for kimi_k25 and must be 1.0 (thinking mode). Got 0.2."
  }
}

Fireworks takeaway

If you’re integrating K2.5 in production, don’t assume every user tolerates arbitrary sampling overrides. We decided to make this a default, allowing users to still override the params, just as they can with other models on our platform.

Empty reasoning blocks handling: “the prompt got polluted by null reasoning”

The story

Some SDKs serialize optional fields as explicit null. In multi‑turn chat, developers often replay the entire message history back to the model, including prior assistant messages that may contain reasoning_content: null.

If that null gets accidentally rendered into the prompt as a string (depending on template behavior), it can confuse the model in subtle ways — look at it as a prompt engineering side-effect, where the model sees a lot of “null” strings in the message history and is coerced into repeating the same in its responses.

What users saw

•Strange “phantom tokens” influencing responses (“why is it talking about ‘none’?”).
•Lower tool-call reliability in multi-turn flows.

Canonical example

123456789101112131415
{

"model": "accounts/fireworks/models/kimi-k2p5",

"messages": [

{ "role": "assistant", "content": "Sure.", "reasoning_content": null },

{ "role": "user", "content": "Now call the tool." }

],

"tools": [ /* ... */ ]

}

What we changed

We sanitize message history before prompt construction by dropping reasoning_content when it is null, keeping the prompt clean and consistent.

Launch-day overload: “HTTP 200 with an empty body” (silent failures)

The story

Under extreme load, a gateway misconfiguration can turn what should be a clear rate-limit response into a confusing “success” status code with no payload. Many clients treat HTTP 200 as non-retryable, so this becomes a silent reliability killer.

What users saw

•status_code == 200
•response body is empty ("")
•client throws JSONDecodeError when trying to parse JSON
•retries don’t trigger because the status was “successful”

Canonical client symptom

1234
resp = requests.post(url, json=payload, timeout=30)
resp.raise_for_status()   # passes (200)
data = resp.json()        # JSONDecodeError: empty response body

What we changed

We hardened the serving path so overload/expired conditions map to the correct HTTP 429 behavior across streaming and non-streaming, enabling normal retry logic on clients.

The Benchmark Gap: What It Takes to Ship Kimi K2.5

Table of Contents

Table of Contents