How Fireworks evaluates quantization precisely and interpretably
By Fireworks Team|8/1/2024
By Fireworks Team|8/1/2024
With the release of Llama 3.1, there’s been considerable discussion about the benefits and tradeoff of different quantization methods. Evaluating quantization quality is notoriously tricky, so in this post, we share how Fireworks approaches quantization and evaluates tradeoffs.
With both general LLM inference and quantization, we believe there’s no one-size-fits all solution for LLM inference. Inference and quantization set-ups are ideally tailored specially for a certain use case.
A common misunderstanding with quantization is that it’s black and white and that a model is either quantized (well) or not. However, there are both
Generally, as you increase quantization aggression level, performance improves and quality degrades. However, it’s possible to achieve disproportionate performance improvements with an insignificant effect on quality. The tradeoff between quality and performance varies based on factors including:
Generally, the goal with quantization is to hit a sweet spot on the Pareto curve of quality vs speed. We work with enterprise customers individually to find this spot. However, for our public endpoints there’s no one perfect configuration since there are a variety of use cases on the platform.
How should one measure model quality for quantization? Since quality is use case dependent, developers are ultimately the best judges for quality for their application. However, to measure general model quality, we prefer to focus on divergence metrics (how much the quantization changes outputs of a particular model) as opposed to pure capability metrics (i.e. how much the quantized model scores on general benchmarks like MMLU).
This idea is well described in a recent “Accuracy is Not All you Need” paper from Microsoft Research. Simply put, quantization introduces noise that can flip some right answers to wrong but also some wrong answers to right (especially when the model is “on the fence”). This skews accuracy. Focusing on changes in model probability distribution is more precise and hence gives enough resolution to interpret effects of individual quantization techniques.
Specifically we focused on two divergence metrics:
We further breakdown these metrics for prefill and generation (different parts of inference may use different quantization techniques) to understand the divergence:
Our methodology is as follows:
We evaluated KL divergence on Llama 3.1 8B Instruct on 4 different “levels of quantization”, in which different parts of the model are quantized. For comparison, we also compare against MMLU. The Fireworks platform exposes logprobs, so we encourage folks to evaluate divergence results for themselves. We also encourage other providers to expose logprobs to help the community analyze quantization tradeoffs. Results are shown below.
Metrics for quantization level (May not match Meta’s reported exactly due to templating differences. Bolded number is an example of a high-noise MMLU metric)
A few notable takeaways
At Fireworks, we carefully compare our deployed models to reference models, to the point of finding bugs in Hugging Face implementations. Task-based accuracy metrics like MMLU aren’t sensitive enough to differentiate between quantization types. However, they can still be helpful to run as a sanity check.
We ran the Helm-Lite evaluation suite and other tests against Llama 3.1 70B on Fireworks and Together AI’s endpoint. With the Llama 3.1 release, Meta published official reference evaluations that include fully formatted prompts, which we reproduced. We see almost no discrepancy between the models across dimensions. To try the official Meta evaluation, check out our reproduction script here.
Even given our strong results on MMLU and other task-based benchmarks, we advise against drawing conclusions about quantization quality from small discrepancies in these benchmarks. Task-based metrics work well for analyzing foundation model quality. However, they have poor sensitivity at comparing between quantization techniques due to noisy, all-or-nothing scoring methods.
Task-based evaluations judge correctness as a step function. Consider a situation where a reference model has a distribution of 0.51/0.49 for the correct answer vs the wrong answer. A quantized model could have a similar distribution of 0.49/0.51 but a task-based evaluation would score the model as if it had a 0/1 distribution.
This all or nothing approach leads to meaningful amounts of inaccuracy. For example, we see that quantized model quality sometimes improves on benchmarks. For example, in the below diagram from Together AI, quantized models (Turbo) results on GSM8K and MMLU EM are actually better than non-quantized models by several percentage points.
What’s happening is not a magical increase in model quality from quantization, but a reflection of noise throughout the benchmarks. The obvious high levels of noise means that takeaways drawn from small differences on benchmarks (especially < 1%) are misguided.
Perplexity
Perplexity is an aggregate metric of how well LLM models predict given text distribution. One drawback is that evaluating perplexity on a model’s own generation may produce biased results if it is “overconfident” in its outputs Additionally it suffers from averaging bias. To quote from Accuracy is Not All You Need:
“Our observation that the difference between two models’ output token values cancel out leaving the average metric result unchanged, is applicable to perplexity as well. In particular, since perplexity may be interpreted as the inverse of the geometric mean of token probabilities, lower probabilities for some tokens in the test dataset may be canceled by higher probabilities of other tokens.”
KL divergence metric described above is closely related to perplexity but addresses the two drawbacks.
AlpacaEval and LLM-as-a-judge methods
Having humans do blind voting on the preferred answer among the two produced by LLMs represents the golden standard for LLM evaluation. This approach is spearheaded by LMsys Chatbot Arena, which has counted more than 1.5M votes to date. Human evaluation is costly though, so metrics like AlpacaEval have popularized using a powerful LLM (e.g. GPT-4) to pick the preferred answer. However, this approach has limitations:
We observe quite noisy results from this benchmark in practice with large swings in results without clear causation. For example, we observe cases where quantized models are evaluated as superior to reference models.
Arena-Hard-Auto is a newer benchmark created by the team behind LMsys Chatbot Arena itself. It aims to address limitations:
We ran Arena-Hard-Auto v0.1 for Llama 3.1 405b Instruct on several quantized model configurations. There’s no clear difference, any gap is dwarfed by confidence intervals.
Given that quantization has different effects on different use cases, the best judge of quantization quality is ultimately the end developer. We encourage you to try both our quantized models and unquantized models (labeled “-hf”) on your use case. We quantize models for a variety of our enterprise customers to maximize speed, quality and cost for applications that reach millions of users. Roland Gavrilescu, AI lead at Superhuman, writes:
“We've been thrilled with Fireworks' FP8 models. They've enabled Superhuman to provide a better user experience by enabling low-latency and high response quality for Ask AI. The serving efficiency of the deployment enables us to deliver a stellar customer experience."
Our quantization approach enables Fireworks to deliver industry leading speed and cost. For example, we helped Cursor reach 1000 tokens/sec with their Fast Apply feature. As Sualeh Asif, Anysphere’s co-founder says:
“Fireworks has been an amazing partner getting our Fast Apply and Copilot++ models running performantly. They were a cut above other competitors we tested on performance. We’ve done extensive testing on their quantized model quality for our use cases and have found minimal degradation. Additionally, Fireworks has been a key partner to help us implement task specific speed ups and new architectures, allowing us to achieve bleeding edge performance!”
Balancing speed and quality also allowed us to release Llama 3.1 405B at 10x cost efficiency of other providers.
To check out the speed, cost efficiency and quality that Fireworks’ quantization expertise affords, try out models on Fireworks’ public platform. If you’re interested in having Fireworks help customize quantization of a model for your enterprise use case, please contact us ([email protected]). At Fireworks, we’re excited to contribute to community discussion, so please reach out with any questions or suggestions for further deep dives.