Partnering with Meta: Bringing Llama 3.2 to Fireworks for Fine-Tuning and Inference
By Fireworks AI Team|9/25/2024
Llama 3.3 70B Instruct just dropped, featuring improved reasoning, math, and instruction-following. Try it out!
By Fireworks AI Team|9/25/2024
We are excited to announce support for the newest additions to the Llama collection from Meta. With the addition of Llama 3.2, developers gain access to new tools that enable the creation of sophisticated multi-component AI systems that combine models, modalities, and external tools to deliver advanced real-world AI solutions.
The release of Llama 3.2 1B, Llama 3.2 3B, Llama 3.2 11B Vision, and Llama 3.2 90B Vision models brings a range of text-only and multimodal models designed to enhance modular AI workflows. These models provide deep customization, allowing developers to tailor solutions and accelerate specific tasks in compound AI systems.
Get started today on Fireworks:
The instruct variants of these models are available serverless (pay-per-token for models on Fireworks-configured GPUs). Both the instruct and non-instruct variants of these models are available on-demand (private GPU instances billed per GPU second). Meta’s Llama Guard 3 models for detecting violating content are also available on-demand.
The release of multimodal models unlocks exciting new production use cases for developers, from enterprise to everyday applications.
Examples of use cases for Llama 3 models on Fireworks includes:
See how customers like AlliumAI are supporting multimodal models in production with Fireworks in this blog post.
Llama models, fine-tuned and deployed through Fireworks, offer developers the flexibility to build personalized AI systems tailored to specific needs.
With Fireworks handling the fine-tuning and inference, developers can leverage these powerful tools to accelerate innovation and bring their AI solutions to market faster. For example, Fireworks can serve Llama 3.2 1B in approximately 500 tokens/second and Llama 3.2 3B in 270 tokens/second.
There’s no one-size-fits-all approach to developing compound AI systems, which is why Fireworks offers a number of different options for using and deploying models like Llama-3.2 for production AI (including serverless, on-demand, and enterprise reserved).
We’re also happy to announce new, competitive pricing for text and multimodal models, especially for the Llama 3.2 - 11B and Llama 3.2 - 90B multimodal models which will be the same price as the text-only models. Images will be charged as text tokens. The exact number of tokens depends on image resolution and model. Images for the Llama 3.2 vision models are typically counted as 6400 text tokens.
Fireworks serverless is the easiest way to get started. Serverless offers the new Llama models on pre-configured GPUs, no set-up required.
For heavier volume and fully configurable latency and reliability, Fireworks on-demand provides private GPUs to host Llama 3.2. Developers pay per second for on-demand with no commitments. The efficiency of Fireworks’ software stack enables significant price, throughput and latency improvements compared to running vLLM on private GPUs (see pricing and performance tests).
For high-volume applications, Fireworks offers private, enterprise GPU deployments options that are fully personalized and backed by SLAs and performance guarantees.
Contact us for additional information.
Meta is releasing the Llama 3.2 models under very open and permissive licensing that makes them ideal for fine-tuning and additional model customization.
Today you can fine-tune the Llama 3.2 1B (text-only) and Llama 3.2 3B (text-only) models on Fireworks, with fine-tuning for multimodal models like Llama 3.2 11B Vision and Llama 3.2 90B Vision coming soon.
For more information about fine-tuning, read our documentation here.
With our model playground, you can focus on developing a feel for a model’s behavior, adjusting prompt and parameter values, and then grabbing the code to test our inference APIs.
For example, how would you describe making pancakes from scratch to someone that’s learning to make breakfast for themselves, the very first time?
Is date night coming up and you need some ideas on the fly? Try out the Llama 3.2 3B chat API.
Writing a paper for an art history class about comedic paraodies of famous artworks but need help getting started? Llama 3.2 90B Vision can help out.
To quickly get up and running using Llama 3.2 on the Fireworks AI visit fireworks.ai to sign up for an account. Pickup the API Key from Profile on top right -> API Keys.
pip install --upgrade fireworks-ai
Below code snippet instantiates Fireworks client and uses chat completions API to call the Llama 3.2 listed at - accounts/fireworks/models/llama-v3p2-3b-instruct
.
The above API request results in the below response.
Ready to start building with Llama 3.2? Here’s how to get started:
Contact us: Reach out to discuss how we can help you leverage Llama 3.2 and the Fireworks inference engine for your specific use case.