It’s been an incredible week for the open-source AI community. The release of GPT-OSS marked a significant milestone, opening up new possibilities for developers and researchers worldwide. This is especially exciting as it is released from a US frontier lab 🇺🇸. At Fireworks.ai, we believe that making a model available is only the first step. The real work lies in making it reliable, performant, and truly production-ready.
From the moment GPT-OSS was released, our team worked tirelessly not just to host it, but to provide the single best implementation available anywhere. Our "Quality First" approach meant diving deep into the code, identifying critical issues, and deploying robust fixes to ensure our partners and the entire community could build on a solid foundation.
We were proud to help power the official demo site at gpt-oss.com and support the Hugging Face team to ensure everyone could experience the power of GPT-OSS from day one.
For modern AI applications, tool calling (or function calling) is not just a feature; it's the bridge between language models and the real world. It allows models to interact with APIs, access external data, and perform actions, transforming them from simple chatbots into powerful agents.
However, the initial release of GPT-OSS, while powerful, had inconsistencies and bugs in its tool-calling implementation. This meant developers would face malformed outputs, failed function calls, and unreliable behavior—barriers to building robust, production-grade applications.
We immediately focused our engineering efforts on solving this. Our deep dive paid off, and the results were quickly recognized by the community. OpenRouter, a leading LLM gateway that rigorously tests model providers, highlighted the quality of our implementation, naming it the best for tool calling.
Our commitment to quality extends to the entire open-source ecosystem. After identifying and fixing a critical bug in the model's tokenizer logic, we didn't keep it to ourselves. We immediately upstreamed our fix, sharing it with the community to ensure that every implementation of GPT-OSS could benefit from this improvement. This is what true open-source collaboration looks like.
Here’s where it gets exciting. After implementing our fixes, particularly the harmony tokenizer fix, we re-ran the tool-calling benchmarks. Our findings suggest that the initial scores reported by OpenAI were based on the model before these critical bugs were addressed.
With the fixes in place, the model’s true capabilities are even more impressive than we thought. We believe OpenAI under-reported their tool-calling benchmarks by 5~10% due to possible issues with tool calling or differences in test setup**.**
Here are the scores on our production-ready, fully-patched GPT-OSS implementation across several challenging tool-use benchmarks, all done with average 8 runs.
The numbers reported above are an average over 8 runs, where we used:
1
12345
12345678910111213141516171819
Note that for results that is supposed to be run with temperature > 0, we strongly recommend model builders to release benchmark with CI results going forward, so that the users can have clearer expectation on model performance, and it can also help uncover cases where model builders accidentally under report their results.
For developers building the next generation of AI applications, quality is everything. When you build on Fireworks.ai, you get:
The hype around a new model release is exciting, but enduring value comes from meticulous engineering and a relentless focus on quality. At Fireworks.ai, we didn’t just host GPT-OSS; we perfected its implementation to deliver the performance and reliability the community deserves.
Try out gpt-oss on fireworks at https://fireworks.ai/models/fireworks/gpt-oss-120b . Let us know if you have any feedback!