Qwen2.5-VL is a multimodal large language model series developed by Qwen team, Alibaba Cloud, available in 3B, 7B, 32B, and 72B sizes
Fine-tuningDocs | Qwen2.5-VL 3B Instruct can be customized with your data to improve responses. Fireworks uses LoRA to efficiently train and deploy your personalized model |
On-demand DeploymentDocs | On-demand deployments give you dedicated GPUs for Qwen2.5-VL 3B Instruct using Fireworks' reliable, high-performance system with no rate limits. |
Qwen2.5-VL 3B Instruct is a 4.1 billion parameter instruction-tuned multimodal model developed by the Qwen team at Alibaba Cloud. It is part of the Qwen2.5-VL series and supports image-text understanding, structured vision outputs, and tool-using capabilities for agentic tasks.
Qwen2.5-VL 3B Instruct is designed for:
The maximum context length for Qwen2.5-VL 3B is 128,000 tokens.
Yes. The model lists over 60 quantized versions, including 4-bit and 8-bit variants.
The model performs well on benchmarks such as DocVQA (93.9), InfoVQA (77.1), and MathVista (62.3) but slightly underperforms larger Qwen2 variants in MMBench and MMStar. The model may have trade-offs in temporal and spatial localization when enabling certain extended context features like YaRN.
No, streaming responses and function calling are not supported.
The model has 4.1 billion parameters.
Yes. Fireworks supports LoRA fine-tuning for this model.
The model is licensed under the Qianwen License (similar to MIT), which permits commercial use and redistribution.