Announcing custom models and on-demand H100s with 50%+ lower costs and latency than vLLM
By Ray Thai|6/3/2024
By Ray Thai|6/3/2024
At Fireworks, we’re empowering developers to productionize generative AI with unparalleled speed, quality and cost. In March, we launched on-demand (dedicated) deployments, which lets developers provision their own GPU(s) for guaranteed latency and reliability. These GPUs run on the proprietary Fireworks serving stack, which enables much faster serving speeds than competing and open-source solutions, like vLLM, even on an identical hardware set-up.
Today, we’re making on-demand deployments more configurable and powerful with the launch of:
We’ve seen the open-source community’s strong response to custom models like Hermes 2 Pro Llama 3, a variation of Llama with enhanced function calling and task following. By choosing from countless custom models, developers can select the model that has the best quality for their use case. We’ve made it simple to import these models into Fireworks to use with our affordable and optimized infrastructure.
How do I import models? Simply provide Fireworks with files for your model in the Hugging Face model format and upload the model with the following command.
Then you can spin up an on-demand deployment in seconds without needing to install or configure software. We support models from the most popular model architecture, including Llama and Mixtral. See our full docs for more details.
At Fireworks, we know there’s no one-size-fits-all model and serving configuration. The best serving stack configuration depends on specific goals and prompts. Fireworks has been personally tailoring serving configurations for our enterprise customers. Now, Fireworks is bringing personalized optimizations and choices to the public through our on-demand GPUs.
We’ve made it easier and more configurable to use on-demand deployments by introducing the ability to scale to and from 0. By default, GPUs will start up and scale down automatically based on usage. If you haven’t used your model for a while (default one hour), your GPU capacity will be scaled to 0 and you won’t be charged for this idle time. When you get a request, your GPU will be automatically spun back up and usable again. Users do not pay for start-up time.
This makes it both much easier and cost-effective to use on-demand deployments. Simply query the API - no need to set up the deployment before each usage! Reduce costs as well by ensuring that your deployments are scaled down when they’re not in use.
Beyond scaling from 0, you can also set your deployments to scale to multiple GPUs to support spikes in traffic. We’ve also made our auto-scaling logic configurable, meaning that you can set the specific time to wait before scaling a deployment up and down with traffic. This provides maximum choice in optimizing for cost-savings and user experience.
Through these improvements, Fireworks on-demand deployments provide the fastest and most affordable solution to serve LLM traffic on private GPUs for predictable performance.
On-demand deployments provide options for reliable, fast serving at scale for businesses that are ready to scale up from our serverless offering but not yet ready for long-term, enterprise contracts. Curious about performance details of on-demand deployments or want more info about how on-demand deployments compares to serverless or other frameworks? Check out our deep dive on why on-demand deployments.
When you’re ready to get started, check out our docs. At Fireworks, we’re creating the best platform for everyone to serve generative AI models in production, from nascent start-ups to large enterprises.
We’d love your feedback! Please contact us on Discord or Twitter. If you’re looking to learn more about on-demand deployments, feel free to directly schedule time with our PM (https://calendly.com/raythai). We can’t wait to see what you build!