
Exciting news for vision AI! Fireworks is proud to offer Day-0 support for the highly anticipated NVIDIA Nemotron Nano2 VL, a 12B multimodal reasoning model for accelerating your document intelligence and video understanding applications.
NVIDIA Nemotron Nano2 VL, the latest innovation in the NVIDIA Nemotron family, is a vision language model (VLM) designed to push the boundaries of intelligent document processing, AI assistant video understanding, video captioning, multi-modal agentic workflows, and more. It enables AI assistants to extract, interpret, and act on information across text, images, tables, and video. VLMs are built by combining an LLM with a vision encoder, enabling the LLM with eyes. VLMs often require a more complex architecture to integrate across multiple modalities. With Fireworks' Multimedia, developers can effortlessly unlock insights across various modalities from VLMs like NVIDIA Nemotron Nano2 VL, bypassing the complexities of unstructured multi-domain workflows. In an invoice analysis example, Nemotron Nano2 VL surpassed over 90% accuracy, resulting in a high quality output. In the scenario, the Nemotron Nano2 VL model automated the entire process of data extraction, classification, and summarization, eliminating the need for tedious, time-consuming manual evaluation.
With its high accuracy, compact model footprint, and multimodal capabilities, Nemotron Nano 2 VL seamlessly extracts and comprehends information from complex documents, videos and images. For intelligent document assistants, this includes diverse inputs from text-base documents to images, charts and graphs, making it an ideal solution for automating document workflows across industries like finance, healthcare, legal, and government. In multi-image processing, it assists with tasks such as captioning and content curation, making it ideal for product catalog parsing and image search. Lastly it is optimal for multimodal and agentic pipelines that may need image aware retrieval and tool use.
The efficient model combines the optimized hybrid Mamba-Transformer architecture from the Nemotron family for the LLM with a Vision Encoder based on CRADIOH-V2, and an efficient video search token compression model.
It provides three key features:
Check out the figure below showcasing a variety of benchmarks from NVIDIA on the Nemotron Nano2 model.

We're excited to announce that the latest Nemotron Nano VL model is now available on Fireworks! We've prepared a comprehensive cookbook to help you explore its performance. The model is optimized for things like OCR(optical character recognition) in document processing. Unlike older OCRs systems that simply convert an image to text, a VLM understands the semantic context and spatial relationships between elements. In this process the model will analyze the image and distinguish between the characters and the background. The model used the pattern found in the image to identify the characters and convert them into machine readable text. This example demonstrated NVIDIA Nemotron Nano 2 VL on Fireworks AI being used for invoice processing and document intelligence. The main task was to extract invoice numbers, dates, line items, and totals. The table below shows our test result success rates for parsing the invoices. With NVIDIA Nemotron we were able to achieve overall quality rates in the 90s.
| Test | Quality Rate |
|---|---|
| Invoice Number | 100% |
| Date | 100% |
| Item Count | 100% |
| Total Amount | 63.2% |
| Overall | 90.8% (19/20 Successful Extractions) |
Other use cases that could benefit from a scenario similar to invoice processing include account payable automation, expense management and receipt processing, financial document digitization, and compliance and audit workflows.
Next Steps:
Ready to deploy this solution in production? Try:
VLMs like Nemotron can drastically outperform manual document analysis by offering superior speed, accuracy, and scale, as they holistically understand both the text and the context of these complex documents. This automation eliminates human error, significantly lowering operational costs, and freeing staff to focus on more strategic work.
Automate your document intelligence and multi-image processing workflows today by deploying Nemotron Nano 2 VL on Fireworks AI. For further questions, reach out on Discord or via [email protected].