
Kimi K2.5 is live on Fireworks at ~1/10 the cost and 2-3x the speed of closed frontier models. As the fastest open-source provider of Kimi K2.5, Fireworks is seeing unprecedented model adoption. Kimi K2.5 is a landmark release for open models with benchmark results on par with top closed models and unprecedented visual coding quality. But enabling full quality in production requires more than just hosting the model.
Here's how Fireworks ensures that developers get the best quality on our platform and how that translates into specific edge cases.

Deploying frontier open models has taught us that quality emerges or degrades in the gaps: between the model and serving stack, between the chat template on Hugging Face and what’s running in the first-party API. Our quality process is built to systematically close gaps that benchmarks miss. We’ve applied this approach through multiple generations of models. We were the first to discover GPT-OSS quality bugs and pioneered KL divergence measurements to assess quantization quality.
Bringing Kimi K2.5 into real-world production reinforced why comprehensive quality validation is needed beyond benchmark scores alone. The examples below highlight specific classes of issues that routinely surface when deploying frontier open models—and the engineering strategies we use to address them. Detailed technical breakdowns for each case are included in the appendix.
123456789
4. Launch day overload — Under launch-day load, a gateway misconfiguration returned HTTP 200 with empty bodies instead of 429 rate-limit errors. Because the responses appeared successful, retries never triggered and users saw silent failures with no indication anything had gone wrong.
Beyond specific errors: We see ongoing instability in model outputs when using thinking and tool calling together. We believe this is a model-level issue (for example, K2VV tool calling benchmark runs only in non-thinking mode) and are working with the Kimi team for potential fixes.
Taken together, these examples highlight a broader lesson: production quality is an end-to-end property. It emerges from the interaction between model behavior, decoding constraints, serving infrastructure, and client expectations—not from the model alone.
Benchmarks don’t tell the full story, but they remain a useful sanity check for validating production readiness. We matched the official Kimi API across KVV results.
KVV is Kimi’s evaluation and validation suite, focused on the areas most vulnerable to infrastructure failure: tool calling correctness, structured outputs, and multimodal reasoning. We supplemented with longer agentic evaluations to catch what the benchmarks miss.
All evaluations run in thinking mode unless otherwise noted.
| Benchmark | Kimi Model Card | Kimi Official API (Measured by FW) | Fireworks API |
|---|---|---|---|
| SWE-Bench | 76.8 (custom harness) | 76.8 | 76.8 |
| Terminal-Bench 2 (non-thinking) | 50.8 (custom harness) | 50.6 | |
| Tau2-bench Airline | 65 | 68 | |
| AIME 2025 | 96.1 | 95.7 | 95.0 |
| K2VV ToolCall: F1 (non-thinking) | 84 | 83.9 |
| Benchmark | Kimi Model Card | Fireworks API |
|---|---|---|
| MMMU Pro Vision | 77.4 | 77.9 |
| MMMU Pro Vision (non-thinking) | 74.7 | 74.5 |
| OCRBench | 91 | 91.7 |
Note: All scores represent an average of 3 runs except Tau-Bench which is an average of 4 runs and SWEBench which is an average of 2 runs
Used harnesses for running benchmarks:
Some benchmarks abstract away important production considerations:
Conclusion: We built this process for Kimi K2.5, but it reflects how we approach every model we ship at Fireworks. As models continue to advance, the gap between benchmark performance and production reliability will only grow unless infrastructure, decoding, and evaluation evolve alongside them.
This is the work we focus on: making frontier open models dependable in real systems. If you encounter edge cases we haven’t covered, let us know—we’re committed to closing the benchmark gap together.
The story
Under load, some tool-required requests would occasionally end early while the model was still “thinking”. For developers, that looked like a flaky agent: 99% of the time it called the tool; 1% of the time it returned nothing useful (or just a fragment) and never reached the tool call.
What users saw
Canonical example (streaming request)
1234567891011121314151617181920212223242526
Failure mode (simplified SSE shape)
1234
What we changed
We added a constrained-generation guardrail that bans special tokens (including EOS) during the “thinking” phase when the request uses trigger-based grammar activation. The goal is simple: don’t allow “premature stop” before the model reaches the delimiter that activates structured/tool generation.
The story
Many teams standardize sampling knobs globally (e.g., “always use temperature=1 as per moonshot’s recommendation”). But Kimi K2.5’s vendor verifier expects certain sampling settings by mode, and some deployments enforce those as immutable to match preflight expectations. These were communicated to us by moonshot as hard defaults that raise an HTTP 400 response, when the user tries to override these params.
What users saw
Canonical request
1234567
Canonical error (example)
123456
Fireworks takeaway
If you’re integrating K2.5 in production, don’t assume every user tolerates arbitrary sampling overrides. We decided to make this a default, allowing users to still override the params, just as they can with other models on our platform.
The story
Some SDKs serialize optional fields as explicit null. In multi‑turn chat, developers often replay the entire message history back to the model, including prior assistant messages that may contain reasoning_content: null.
If that null gets accidentally rendered into the prompt as a string (depending on template behavior), it can confuse the model in subtle ways — look at it as a prompt engineering side-effect, where the model sees a lot of “null” strings in the message history and is coerced into repeating the same in its responses.
What users saw
Canonical example
123456789101112131415
What we changed
We sanitize message history before prompt construction by dropping reasoning_content when it is null, keeping the prompt clean and consistent.
The story
Under extreme load, a gateway misconfiguration can turn what should be a clear rate-limit response into a confusing “success” status code with no payload. Many clients treat HTTP 200 as non-retryable, so this becomes a silent reliability killer.
What users saw
Canonical client symptom
1234
What we changed
We hardened the serving path so overload/expired conditions map to the correct HTTP 429 behavior across streaming and non-streaming, enabling normal retry logic on clients.