Optimizing Large Language Model (LLM) machine performance in inference is a complex space and no solution is one-size-fits-all. Use cases may require low latency (chatbots, coding assistants, Q&A) or high throughput (catalog creation/improvement, model evaluation); They may require processing long inputs (summarization) or short inputs (chat); Or they may require generating long outputs (code completion) or short outputs (intent classification). In this post, we explore how different use cases call for different optimization objectives and show how Fireworks provides many different deployment configurations that optimize for these cases.
For LLMs, workload characteristics vary drastically across use cases:
Customers should take these and more factors into consideration when determining how to best evaluate an LLM solution.
Given the above problem space, there are many ways LLMs can be deployed to optimize for these factors. To name a few:
Given the above complexity, at Fireworks we provide many different solutions to customers based on their requirements. We offer the easy-to-use on-demand Developer PRO tier, which can be accessed via the console for no-commitment, lightweight testing of LLMs.
On the other end of the spectrum, we provide optimized deployments specific to customer needs. As a concrete example, the graph below examines a Llama-70B deployment with 550 input tokens and 150 output tokens, running on 8x NVIDIA A100. We explore latency at varying levels of load (QPS) for 3 deployment configurations. The baseline blue line is the Fireworks Developer PRO setup, optimized for on-demand experimentation (but which we are constantly optimizing as well). The green line shows a latency-optimized configuration, which achieves the fastest response time at low load, but sacrifices response time at higher load. Finally, the red line shows a configuration that balances latency and throughput, providing low latency across many levels of load.
Fireworks works with customers to choose the configuration that best suits their use case.
There is no one-size-fits-all solution for LLM inference. Customers should keep in mind the use case requirements (e.g. input and output length), optimization objective (throughput, latency, cost) and match vendor solutions to those requirements. Try the Fireworks platform and email [email protected] to inquire about optimized LLM deployments.
Getting the right tools to measure performance is crucial, too. We're releasing the benchmark suite we've been using at Fireworks to evaluate described performance tradeoffs. We hope to contribute to a rich ecosystem of knowledge and tools (e.g. those published by Databricks and Anyscale) that help customers optimize LLMs for their use cases. Stay tuned for more details about shared metrics and benchmarking experience from Fireworks in the next blog post.