Skip to main content
New to Fireworks? Start with the Serverless Quickstart to see a vision model example, then return here for more details.
Vision-language models (VLMs) process both text and images in a single request, enabling image captioning, visual question answering, document analysis, chart interpretation, OCR, and content moderation. Use VLMs via serverless inference or dedicated deployments. Browse available vision models →

Chat Completions API

Provide images via URL or base64 encoding:
  • Python
  • JavaScript
  • curl
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("FIREWORKS_API_KEY"),
    base_url="https://api.fireworks.ai/inference/v1"
)

response = client.chat.completions.create(
    model="accounts/fireworks/models/qwen2p5-vl-32b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Can you describe this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)
Instead of URLs, you can provide base64-encoded images prefixed with MIME types:
import os
import base64
from openai import OpenAI

# Helper function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Encode your image
image_base64 = encode_image("your_image.jpg")

client = OpenAI(
    api_key=os.environ.get("FIREWORKS_API_KEY"),
    base_url="https://api.fireworks.ai/inference/v1"
)

response = client.chat.completions.create(
    model="accounts/fireworks/models/qwen2p5-vl-32b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Can you describe this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_base64}"
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

Working with images

Vision-language models support prompt caching to improve performance for requests with repeated content. Both text and image portions can benefit from caching to reduce time to first token by up to 80%. Tips for optimal performance:
  • Use URLs for long conversations – Reduces latency compared to base64 encoding
  • Downsize images – Smaller images use fewer tokens and process faster
  • Structure prompts for caching – Place static instructions at the beginning, variable content at the end
  • Include metadata in prompts – Add context about the image directly in your text prompt

Advanced capabilities

Alternative query methods

For the Completions API, manually insert the image token <image> in your prompt and supply images as an ordered list:
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("FIREWORKS_API_KEY"),
    base_url="https://api.fireworks.ai/inference/v1"
)

response = client.completions.create(
    model="accounts/fireworks/models/qwen2p5-vl-32b-instruct",
    prompt="SYSTEM: Hello\n\nUSER:<image>\ntell me about the image\n\nASSISTANT:",
    extra_body={
        "images": ["https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"]
    }
)

print(response.choices[0].text)

Known limitations

  1. Maximum images per request: 30 images maximum, regardless of format (base64 or URL)
  2. Base64 size limit: Total base64-encoded images must be less than 10MB
  3. URL size and timeout: Each image URL must be smaller than 5MB and download within 1.5 seconds
  4. Supported formats: .png, .jpg, .jpeg, .gif, .bmp, .tiff, .ppm
  5. Llama 3.2 Vision models: Pass images before text in the content field to avoid refusals (temporary limitation)