GLM 5.2 is live! Opus-level intelligence at open-source rates. Pay per token on serverless. Try it today.

Model Library
/NVIDIA/NVIDIA Nemotron 3 Ultra NVFP4
NVIDIA icon

NVIDIA Nemotron 3 Ultra NVFP4

Ready
model path:accounts/fireworks/models/nemotron-3-ultra-nvfp4

Nemotron-3-Ultra-550B-A55B-NVFP4 is a frontier-scale large language model (LLM) trained by NVIDIA, designed to deliver strong agentic, reasoning, and conversational capabilities. It is optimized for the most demanding workloads, including complex multi-step agents, long-context analysis, and high-accuracy reasoning over code, math, and science. The model employs a hybrid Latent Mixture-of-Experts (LatentMoE) architecture, utilizing interleaved Mamba-2 and MoE layers, along with select Attention layers. Like the Super model, the Ultra model incorporates Multi-Token Prediction (MTP) layers for faster text generation and improved quality, and it is trained using an NVFP4 pre-training recipe to maximize compute efficiency. The model has 55B active parameters and 550B parameters in total.

NVIDIA Nemotron 3 Ultra NVFP4 API Features

Serverless

Docs

NVIDIA Nemotron 3 Ultra NVFP4 is available via Fireworks' serverless API, where you pay per token. There are several ways to call the Fireworks API, including Fireworks' Python client, the REST API, or OpenAI's Python client.

On-demand Deployment

Docs

On-demand deployments allow you to use NVIDIA Nemotron 3 Ultra NVFP4 on dedicated GPUs with Fireworks' high-performance serving stack with high reliability and no rate limits.

Available Serverless

Run queries immediately, pay only for usage

$0.60 / $0.12 / $2.40
Per 1M Tokens (input/cached input/output)

Metadata

State
Ready
Created on
6/2/2026
Kind
Base model
Provider
NVIDIA

Specification

Calibrated
No
Mixture-of-Experts
Yes
Parameters
549B

Supported Functionality

Fine-tuning
Not supported
Serverless
Supported
Context Length
262k tokens
Function Calling
Supported
Embeddings
Not supported
Rerankers
Not supported
Support image input
Not supported