GPUs on-demand: Not serverless, not reserved, but some third thing
By Ray Thai|6/3/2024
Llama 3.3 70B Instruct just dropped, featuring improved reasoning, math, and instruction-following. Try it out!
By Ray Thai|6/3/2024
Note: Fireworks A100 and H100 prices have since been reduced to $2.90 and $5.80!
One of the most rewarding things at Fireworks is being a part of the scaling journey for many, exciting AI start-ups. Over the last few months, we’ve seen an explosion in the number of companies beginning to productionize generative AI. A question that we commonly get is:
“How should I think about serving LLMs via (1) A Serverless, token-based options vs (2) A GPU, usage time-based option? “
We’re writing this post to help explain the tradeoffs of serverless vs dedicated GPU options.
Fireworks has 3 offerings for LLM serving: (1) Serverless (2) On-demand (3) Enterprise.
Serverless
Fireworks hosts our most popular models “serverless”, meaning that we operate GPUs 24/7 to serve these models and provide an API for any of our users to use the models. Our serverless offering is the fastest, widely available platform and we’re proud of its production-readiness. Serverless is the perfect option for running limited traffic and experimenting with different LLM set-ups. However, serverless has limitations:
Serverless is not personalized for you Your LLM serving can be made faster, higher-quality or lower cost based on personalization on several levels:
Our serverless platform is designed to excel at a variety of goals and support hundreds of base models and fine-tuned models. However, a personally-tailored stack still may provide better experiences.
Serverless performance is affected by others - Other Fireworks users share our serverless deployment, so speeds vary depending on overall usage. If you happen to use the deployment at the emptiest hours, you’ll experience the fastest speeds and vice versa. Our public platform is still independently benchmarked to have the lowest variation in latency, but consistent performance is paramount for certain use cases, like live voice chat agents.
Serverless has volume constraints - Generally, when businesses have the volume to use significant GPU capacity, it doesn’t make sense to use serverless because:
Enterprise Reserved GPUs
Given these constraints, companies with large usage volumes often reserve their own private GPU(s) for set periods of time. This commitment also enables Fireworks to help companies personally configure their serving set-up and provide SLAs and guaranteed support. However, many scale-up companies in the midst of prototyping are unable to commit to enterprise reserved capacity.
On-demand GPUs
To make it easier for scaling teams to benefit from Fireworks, we offer on-demand, dedicated GPUs with our FireAttention stack. Users pay per hour and can scale their GPU usage up or down automatically based on traffic. Configurations can automatically scale up and down from 0 GPUs, so developers pay nothing during idle periods. Compared to serverless, using your own GPU can provide:
An important consideration in deciding to use on-demand deployments is expected performance vs price, so we have included some performance details and FAQs.
What latency improvements can I expect compared to vLLM or hosting my own GPU?
Generally, we see that Fireworks is ~40-60% faster compared to open-source solutions but performance varies depending on model and workload. We obtained the below results using the min worldsize (minimum # GPUs to host a specific model) on H100 GPUs on Fireworks vs vLLM software. The latency was calculated with heavy throughput, so the GPU on Fireworks simultaneously delivered significantly better speed and throughput.
Mixtral 8x7b
Llama 3 8B
What throughput/cost improvements can I expect compared to vLLM or hosting my own GPU?
The efficiency of the FireAttention serving stack means that the same H100/A100 chip running on FireAttention can handle significantly more volume than an H100/A100 running software like vLLM.
How else is running a GPU on Fireworks different from provisioning my own GPU with vLLM?
I’m using Fireworks’ serverless model hosting. When does it make sense for me to "graduate" to on-demand?
Using GPUs on-demand can provide a cheaper, faster and more reliable solution than serverless, especially when operating with higher volume. On-demand GPUs can act as a great bridge to a longer-term reserved GPU solution, where developers can benefit from completely personalized set-ups and enterprise SLAs. Fireworks is committed to enabling businesses to scale on our platform from testing an initial prototype on serverless through to serving millions of end users on reserved GPUs.
When you’re ready to get started, check out our docs or hear about our latest product updates for on-demand in this blog post. Looking for more performance numbers or interested in chatting directly about on-demand deployments? Feel free to directly schedule time with our PM (https://calendly.com/raythai). We can’t wait to see what you build!