Instruction-tuned image reasoning model with 90B parameters from Meta. Optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The model can understand visual data, such as charts and graphs and also bridge the gap between vision and language by generating text to describe images details
On-demand DeploymentDocs | On-demand deployments give you dedicated GPUs for Llama 3.2 90B Vision Instruct using Fireworks' reliable, high-performance system with no rate limits. |
Llama 3.2 90B Vision Instruct is a multimodal instruction-tuned model developed by Meta. It combines the Llama 3.1 text model with a vision adapter for image reasoning and supports both image and text input. It is part of the Llama 3.2-Vision model collection.
This model excels in:
It supports a maximum context length of 131,072 tokens on Fireworks.
Fireworks supports the full 131K token window on dedicated GPU deployments (on-demand).
The model has 88.6 billion parameters.
Fine-tuning is supported via LoRA on Fireworks' serverless interface, but standard fine-tuning is not available.
Tokens are counted across combined input and output, up to the 131K limit.
The model is governed by the Llama 3.2 Community License, a custom commercial license that allows research and commercial use under specified terms.