Try the latest GLM-4.6 with extended context, superior coding, and refined intelligence. Now available on-demand

Fireworks Blog

Deployment Shapes One Click Deployment Configured for You

Deployment Shapes: One-Click Deployment Configured For You

fireworks amd

Fireworks and AMD partner to power the next gen of AI infrastructure on AMD Instinct™ GPUs

LLM on the edge: Model picking with Fireworks Eval Protocol + Ollama

LLM on the edge: Model picking with Fireworks Eval Protocol + Ollama

Announcing Embeddings and Reranking on Fireworks AI

Announcing Embeddings and Reranking On Fireworks AI

Deep-Dive into LLM Fine Tuning

Deep-Dive into LLM Fine-Tuning

Production-Ready AI Agents with Optimized Inference with AWS AgentCore

Production-Ready AI Agents with Optimized Inference with AWS AgentCore

Fireworks for Startups

Launching Fireworks for Startups Program!

Audio September Release

Audio September Release - Streaming Transcription V2 and Streaming Speaker Diarization

Traces Are All You Need (to rank LLMs)

Understanding Embeddings and Reranking at Scale

Understanding Embeddings and Reranking at Scale

DeepSeek V3.1

DeepSeek V3.1 now on Fireworks AI!

Eval Driven Development with Claude Code

LLM Eval Driven Development with Claude Code

Your AI Benchmark is Lying to You. Here's How We Caught It

Your AI Benchmark is Lying to You. Here's How We Caught It

Test driven agent development with eval protocol

Test-Driven Agent Development with Eval Protocol

Quality first: how Fireworks.ai is the go-to place for gpt-oss

Quality first: how Fireworks.ai is the go-to place for gpt-oss

GPT-OSS Models

Introducing OpenAI gpt-oss (20b & 120b)

Eval Protocol

Announcing Eval Protocol

Qwen 3 Decoded

Qwen3 Decoded: Choosing the Right Model For Your Task

Kimi K2 Deep Dive

Kimi K2: Deep Dive into model performance and use-cases

Fireworks AI Batch API

Run bulk async workloads with Fireworks Batch API

Real-world leaderboard

Fireworks Real-World Benchmarks: Find the Best OSS Model for the Job

Introducing Vision-Language Model Fine-tuning

Introducing Vision-Language Model Fine-tuning: Tailor VLMs to Your Domain

Notion

How Notion Cuts Latency 4x and Scales Enterprise AI Workflows with Fireworks AI

VibeRL: When AI Trains AI

VibeRL: When AI Trains AI

Fireworks Sagemaker

Fireworks AI Now Supports Amazon SageMaker

Deep-dive into MuonClip: Fixing Attention Score Explosions in Transformer Training

Deep-dive into MuonClip: Fixing Attention Score Explosions in Transformer Training

Understanding Function Calling: The Bridge to Agentic AI

Understanding Function Calling: The Bridge to Agentic AI

Sentient & Fireworks Powers Decentralized AI At Viral Scale

Sentient & Fireworks Powers Decentralized AI At Viral Scale

Using Model as Judge for Reward in Reinforcement Fine Tuning

Using Model-as-a-Judge for Reward in Reinforcement Fine Tuning

Flux Kontext on Fireworks

Introducing FLUX.1 Kontext on Fireworks

Announcing Response API with MCP

Unlock Your Tools: Fireworks Adds OpenAI-Response API with MCP Support (Beta)

Fast Food Group Drive Thru

Global Fast Food Group Transforms Drive-Thru with Real-Time Voice Intelligence with Fireworks

Announcing Virtual Cloud on Fireworks AI

Build for Scale with Fireworks Virtual Cloud (GA)

Announcing Updated 3D FireOptimizer

3D FireOptimizer: Automating the Multi-Dimensional Tradeoffs in LLM Serving

Updated Supervised Fine Tuning

Introducing Supervised Fine Tuning V2

Updated Vision Model Platform

Vision Model Platform Updates: Enhanced Capabilities and New Features

Announcing Experimentation Platform

Building AI agents with the Fireworks Experimentation Platform (GA) and Build SDK (Beta)

Announcing Voice Agent Platform Beta

Build customizable, real-time voice agents with Fireworks Voice Agent Platform (Beta)

Reinforcement fine tuning announcement

Reinforcement Fine Tuning (Beta): Train expert open models to surpass closed frontier models

Building a high-quality Synthetic Data Pipeline for Supervised Fine-Tuning

Building a High‑Quality Synthetic Data Pipeline for Supervised Fine‑Tuning

Fireworks AI Dev Day 2025 Wrapped

Fireworks DevDay 2025 Wrapped

Independent benchmarking of Fireworks shows >250 tokens / second on DeepSeek V3

FireAttention V4: Industry-Leading Latency and Cost Efficiency with FP4

Building an open-source Browser Agent on Fireworks AI

Building an open-source Browser Agent on Fireworks AI

Fireworks Summer Audio Updates

Fireworks Summer Audio Updates: Fastest Transcription now with Diarization and Batch API

Agentic AI Systems

Agentic AI Systems

Supervised Fine-Tuning (SFT) with LoRA on Fireworks AI: Tutorial

Supervised Fine-Tuning (SFT) with LoRA on Fireworks AI: Tutorial

Qwen 3 on Fireworks AI

Qwen 3 on Fireworks AI: Controllable Chain-of-Thought and Tool Calling at Frontier Scale

Llama 4 Maverick on Fireworks AI

Optimizing Llama 4 Maverick on Fireworks AI

RAG application using MongoDB Atlas and Fireworks AI

Building Enterprise-Scale RAG Systems with Fireworks AI and MongoDB Atlas

Fireworks AI Now Supports NVIDIA NIM Deployments for Blazing AI Inference

Fireworks AI Now Supports NVIDIA NIM Deployments for Blazing AI Inference

Faster, more efficient DeepSeek on the Fireworks AI Developer Cloud

Faster, more efficient DeepSeek on the Fireworks AI Developer Cloud

Fine-Tuning DeepSeek v3 & R1 to optimize quality, latency, & cost

Fine-Tuning DeepSeek v3 & R1 to optimize quality, latency, & cost

Enabling Function Calling in DeepSeek v3: Bridging the Gap Between Text and Action

Enabling Function Calling in DeepSeek v3: Bridging the Gap Between Text and Action

DeepSeek v3 and R1 Model Architecture: Why it's powerful and economical

DeepSeek v3 and R1 Model Architecture: Why it's powerful and economical

DeepSeek R1 Just Got Eyes with Fireworks AI Document Inlining

DeepSeek R1 Just Got Eyes with Fireworks AI Document Inlining

From text to task: Constrained generation for structured extraction in R1

From text to task: Constrained generation for structured extraction in R1

Distillation with Reasoning: Can DeepSeek R1 Teach Better Than Humans?

Distillation with Reasoning: Can DeepSeek R1 Teach Better Than Humans?

Mistral Small 3 Now Available on Fireworks: Faster, Lighter, and More Efficient

Mistral Small 3 Now Available on Fireworks: Faster, Lighter, and More Efficient

Beyond Supervised Fine Tuning: How Reinforcement Learning Empowers AI with Minimal Labels

Beyond Supervised Fine Tuning: How Reinforcement Learning Empowers AI with Minimal Labels

DeepSeek R1: All you need to know 🐳

DeepSeek R1: All you need to know 🐳

Fireworks Streaming Transcription: 300ms with Whisper-v3-large-quality

Fireworks Streaming Transcription: 300ms with Whisper-v3-large-quality

Real-time, performant code assistance: How Sourcegraph scaled with Fireworks AI

Real-time, performant code assistance: How Sourcegraph scaled with Fireworks AI

Document inlining: Crossing the modality gap with Compound AI

Document inlining: Crossing the modality gap with Compound AI

DeepSeek V3 just got vision capabilities!

DeepSeek V3 just got vision capabilities!

20x faster Whisper than OpenAI - Fireworks audio transcribes 1 hour in 4 seconds

20x faster Whisper than OpenAI - Fireworks audio transcribes 1 hour in 4 seconds

How Cresta drives millions of real-time, AI-powered contact center interactions with Fireworks

How Cresta drives millions of real-time, AI-powered contact center interactions with Fireworks

Fireworks f1: A breakthrough in complex reasoning with Compound AI

Fireworks f1: A breakthrough in complex reasoning with Compound AI

How Upwork and Fireworks deliver faster, smarter proposals for freelancers

How Upwork and Fireworks deliver faster, smarter proposals for freelancers

FLUX.1 on Fireworks: Fast, frugal, and flexible

FLUX.1 on Fireworks: Fast, frugal, and flexible

FireAttention V3: Enabling AMD as a viable alternative for GPU inference

FireAttention V3: Enabling AMD as a viable alternative for GPU inference

Three projects, one platform: A developer's winning streak with Fireworks AI

Three projects, one platform: A developer's winning streak with Fireworks AI

Partnering with Meta: Bringing Llama 3.2 to Fireworks for Fine-Tuning and Inference

Partnering with Meta: Bringing Llama 3.2 to Fireworks for Fine-Tuning and Inference

How Enterprises are using Multimodal Models in production with Fireworks

How Enterprises are using Multimodal Models in production with Fireworks

Multi-LoRA: Personalize AI at scale and deliver the best experience for each customer and use case, with 100x cost-efficiency

Multi-LoRA: Personalize AI at scale and deliver the best experience for each customer and use case, with 100x cost-efficiency

FireOptimizer: Customizing latency and quality for your production inference workload

FireOptimizer: Customizing latency and quality for your production inference workload

Build Your Own Flight Recommendation System using FastAPI, SerpAPI, and Firefunction

Build Your Own Flight Recommendation System using FastAPI, SerpAPI, and Firefunction

Building a RAG with Astro, FastAPI, SurrealDB and Llama 3.1

Building a RAG with Astro, FastAPI, SurrealDB and Llama 3.1

How Fireworks evaluates quantization precisely and interpretably

How Fireworks evaluates quantization precisely and interpretably

Introducing Llama 3.1 inference endpoints in partnership with Meta

Introducing Llama 3.1 inference endpoints in partnership with Meta

Fireworks AI Raises $52M Series B to Lead Industry Shift to Compound AI Systems

Fireworks AI Raises $52M Series B to Lead Industry Shift to Compound AI Systems

How Cursor built Fast Apply using the Speculative Decoding API

How Cursor built Fast Apply using the Speculative Decoding API

FireAttention V2: 12x faster to make Long Contexts practical for Online Inference

FireAttention V2: 12x faster to make Long Contexts practical for Online Inference

Firefunction-v2: Function calling capability on par with GPT4o at 2.5x the speed and 10% of the cost=

Firefunction-v2: Function calling capability on par with GPT4o at 2.5x the speed and 10% of the cost=

Announcing custom models and on-demand H100s with 50%+ lower costs and latency than vLLM

Announcing custom models and on-demand H100s with 50%+ lower costs and latency than vLLM

GPUs on-demand: Not serverless, not reserved, but some third thing

GPUs on-demand: Not serverless, not reserved, but some third thing

Code Generation with Large Language Models - Fireworks AI Take

Code Generation with Large Language Models - Fireworks AI Take

Doomed to Code: How we Teamed Up with Fireworks AI at MistralAI Hackathon to Conquer the Shores of Hell

Doomed to Code: How we Teamed Up with Fireworks AI at MistralAI Hackathon to Conquer the Shores of Hell

Partnering with Meta to bring Llama 3 to Firework’s inference and fine-tuning

Partnering with Meta to bring Llama 3 to Firework’s inference and fine-tuning

Getting Started with Stability’s API Powered by Fireworks

Getting Started with Stability’s API Powered by Fireworks

Optimizing Retrieval Augmented Generation (RAG) with MongoDB Atlas and Fireworks AI

Optimizing Retrieval Augmented Generation (RAG) with MongoDB Atlas and Fireworks AI

Fireworks launches fine-tuning service - Rapidly iterate on quality and scale to production through Fireworks inference

Fireworks launches fine-tuning service - Rapidly iterate on quality and scale to production through Fireworks inference

Fireworks Platform Spring 2024 Updates

Fireworks Platform Spring 2024 Updates

FireFunction V1 - Fireworks’ GPT-4-level function calling model - 4x faster than GPT-4 and open weights

FireFunction V1 - Fireworks’ GPT-4-level function calling model - 4x faster than GPT-4 and open weights

Why do all LLMs need structured output modes?

Why do all LLMs need structured output modes?

FireLLaVA: the first commercially permissive OSS LLaVA model

FireLLaVA: the first commercially permissive OSS LLaVA model

FireAttention — Serving Open Source Models 4x faster than vLLM by quantizing with ~no tradeoffs

FireAttention — Serving Open Source Models 4x faster than vLLM by quantizing with ~no tradeoffs

Fireworks Raises the Quality Bar with Function Calling Model and API Release

Fireworks Raises the Quality Bar with Function Calling Model and API Release

Mixtral 8x7B on Fireworks: faster, cheaper, even before the official release

Mixtral 8x7B on Fireworks: faster, cheaper, even before the official release

LLM Inference Performance Benchmarking (Part 1)

LLM Inference Performance Benchmarking (Part 1)

New in Fireworks: Image-to-Image and ControlNet support for SSD-1B and SDXL!

New in Fireworks: Image-to-Image and ControlNet support for SSD-1B and SDXL!

Fireworks.ai Achieves SOC 2 Type II and HIPAA Compliance

Fireworks.ai Achieves SOC 2 Type II and HIPAA Compliance

Accelerating Code Completion with Fireworks Fast LLM Inference

Accelerating Code Completion with Fireworks Fast LLM Inference

Fireworks.ai Now Available on LangChain Prompt Playground

Fireworks.ai Now Available on LangChain Prompt Playground

Simplifying Code Infilling with Code Llama and Fireworks.ai

Simplifying Code Infilling with Code Llama and Fireworks.ai

Speed, Python: Pick Two. How CUDA Graphs Enable Fast Python Code for Deep Learning

Speed, Python: Pick Two. How CUDA Graphs Enable Fast Python Code for Deep Learning

Fireworks.ai: Fast, Affordable, Customizable Gen AI Platform

Fireworks.ai: Fast, Affordable, Customizable Gen AI Platform

Multi-Query Attention is All You Need

Multi-Query Attention is All You Need