Fireworks RFT now available! Fine-tune open models that outperform frontier models. Try today

Model Library
/Meta/Llama 3.2 90B Vision Instruct
Meta Mark

Llama 3.2 90B Vision Instruct

Ready
fireworks/llama-v3p2-90b-vision-instruct

    Instruction-tuned image reasoning model with 90B parameters from Meta. Optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The model can understand visual data, such as charts and graphs and also bridge the gap between vision and language by generating text to describe images details

    Llama 3.2 90B Vision Instruct API Features

    On-demand Deployment

    Docs

    On-demand deployments give you dedicated GPUs for Llama 3.2 90B Vision Instruct using Fireworks' reliable, high-performance system with no rate limits.

    Llama 3.2 90B Vision Instruct FAQs

    What is Llama 3.2 90B Vision Instruct and who developed it?

    Llama 3.2 90B Vision Instruct is a multimodal instruction-tuned model developed by Meta. It combines the Llama 3.1 text model with a vision adapter for image reasoning and supports both image and text input. It is part of the Llama 3.2-Vision model collection.

    What applications and use cases does Llama 3.2 90B Vision Instruct excel at?

    This model excels in:

    • Visual Question Answering (VQA)
    • Chart, diagram, and infographic analysis
    • Document understanding (DocVQA)
    • Image captioning and image-text retrieval
    • Visual grounding and reasoning tasks
    • Agentic systems that process image + text input
    What is the maximum context length for Llama 3.2 90B Vision Instruct?

    It supports a maximum context length of 131,072 tokens on Fireworks.

    What is the usable context window for Llama 3.2 90B Vision Instruct?

    Fireworks supports the full 131K token window on dedicated GPU deployments (on-demand).

    What are known failure modes of Llama 3.2 90B Vision Instruct?
    • No support for structured function calling or tool use
    • May hallucinate or show reduced precision in adversarial visual prompts
    • Supports only English for image-text tasks
    • Ethical risks include child safety, cyber attacks, and content misuse; mitigated with fine-tuning and Llama Guard
    How many parameters does Llama 3.2 90B Vision Instruct have?

    The model has 88.6 billion parameters.

    Is fine-tuning supported for Llama 3.2 90B Vision Instruct?

    Fine-tuning is supported via LoRA on Fireworks' serverless interface, but standard fine-tuning is not available.

    How are tokens counted (prompt vs completion)?

    Tokens are counted across combined input and output, up to the 131K limit.

    What rate limits apply on the shared endpoint?
    • Serverless: Not supported
    • On-demand: Available with no rate limits on dedicated infrastructure
    What license governs commercial use of Llama 3.2 90B Vision Instruct?

    The model is governed by the Llama 3.2 Community License, a custom commercial license that allows research and commercial use under specified terms.

    Metadata

    State
    Ready
    Created on
    9/23/2024
    Kind
    Base model
    Provider
    Meta
    Hugging Face
    Llama-3.2-90B-Vision-Instruct

    Specification

    Calibrated
    No
    Mixture-of-Experts
    No
    Parameters
    88.6B

    Supported Functionality

    Fine-tuning
    Not supported
    Serverless
    Not supported
    Serverless LoRA
    Supported
    Context Length
    131.1k tokens
    Function Calling
    Not supported
    Embeddings
    Not supported
    Rerankers
    Not supported
    Support image input
    Supported