Serverless 2.0: Three Ways to Run Inference, One API

On this page

Fig. 1 Three synchronous serving paths: Standard, Priority, and Fast. They share one API surface and one fleet. Use Standard by default, Priority for stronger admission during congestion, and Fast for higher generated-token throughput.

Three ways to run serverless inference

Shared serverless is the easiest way to run production AI: send requests, pay per token, let the platform handle the fleet. Until now it came in one flavor, and the only upgrade path was a dedicated deployment with reserved GPUs.

Serverless 2.0 adds two more ways to run, all from the same API surface, none requiring reserved capacity. Each request carries a serving intent:

Intent	Path
"Run this normally."	Standard
"Admit this ahead of normal traffic during congestion."	Priority
"Generate tokens faster."	Fast

Serving intent is a way to reason about the controls below. It isn't a parameter you set. Same API. No capacity reservation. Pay per token.

Standard

The default. Elastic shared infrastructure. First to be queued or rejected under high load.

Priority

Stronger admission during congestion. Standard sheds first; Priority sheds last.

Fast

High-throughput serving path. Same model weights, roughly 2× Standard throughput.

Standard: the default

Elastic, cost-efficient, and already used by every existing API call. Under high platform load, Standard requests are the first to be queued or rejected, which makes it a good fit for development, prototyping, and production workloads that tolerate occasional throttling with retries. No code changes required.

Priority: the reliability path

Under load, the shared fleet sheds some requests to protect overall throughput. Standard traffic is shed first; Priority is shed last. That's the whole idea: Priority gives production workloads stronger admission behavior during congestion, reducing the chance that your request gets rejected when the fleet is saturated.

503 Service Overloaded ratePast 14 days · as of 05/26/2026

Standard

0.082%

of requests shed across the shared fleet under congestion.

Priority

no Priority requests shed over the same window.

Measured across all production traffic on the shared fleet. Not a guaranteed rate. Priority does not reserve GPUs and does not eliminate overload in all conditions.

In peak-load testing, Priority saw a 0% 503 Service Overloaded rate vs. 0.082% on Standard.

Priority does not reserve GPUs and does not eliminate overload entirely. Use it when a workload can tolerate shared infrastructure but not the same overload behavior as default serverless. Activate it with one field, service_tier: "priority". No contract, no reserved capacity. Priority is approximately 1.5× Standard pricing. Works with both the OpenAI- and Anthropic-compatible APIs.

Fast: the speed path

Fast is a high-throughput serving path for supported models. It uses the same base model weights and API contract, served through an optimized path for faster generated-token throughput. It is not a smarter model or a different reasoning tier, just the same model family, optimized for speed.

Use Fast when wall-clock generation time is the bottleneck: agent loops, coding workflows, interactive applications. Supported Fast models currently target 100+ generated tokens per second; results vary by model, prompt shape, output length, and concurrency.

"Kimi K2.5 Turbo is hilariously fast on @FireworksAI_HQ." — DHH · @dhh on X

Pick the right control for the workload

Priority and Fast solve different problems. Priority changes how your request is admitted during congestion; Fast changes the path used to generate tokens. They are not stackable on a single request — pick the one the workload actually needs:

# Default serverless
model="accounts/fireworks/models/kimi-k2p6"

# Pick one: stronger admission during congestion
model="accounts/fireworks/models/kimi-k2p6"
service_tier="priority"

# Or: faster generated-token throughput
model="accounts/fireworks/routers/kimi-k2p6-turbo"

That is the core idea of Serverless 2.0: you should not need a dedicated deployment just to express what kind of serving behavior a request needs.

The 429 that did too much

The new paths give you control. The other half of Serverless 2.0 gives you clarity, because for a long time, one error code was hiding two completely different problems.

Shared does not mean infinite. When traffic across the shared fleet exceeds available capacity, the system protects overall throughput by rejecting some requests instead of letting every request wait in an unbounded queue. That rejection is load shedding, and it is not the same thing as a rate limit.

A rate limit means your account exceeded its configured request or token limits. The fix is on your side: send less traffic, or raise your limit. Load shedding means the shared fleet is temporarily saturated. The fix is to retry, not to treat it like an account-limit problem.

Until now, Fireworks exposed both cases through the same HTTP 429, so two unrelated problems looked identical. You could not reliably tell whether you needed to change traffic patterns, request a higher limit, add retry logic, or move production traffic to a more reliable serving path. So we split them apart into three explicit signals:

Response	Meaning	What to do
`429` Too Many Requests	Your account exceeded its configured request or token limits. Unrelated to fleet load.	Reduce request volume, or raise your limit in the dashboard.
`503` Service Overloaded	The shared fleet is temporarily saturated and the request was shed before generation. Not an outage, and not covered by the SLA.	Retry with exponential backoff and jitter. Use Priority for workloads that need stronger admission during congestion.
`503` Service Unavailable	Something on our side is genuinely not working as expected. Covered by our 99.9% SLA.	Retry briefly, then alert or file a support ticket if it persists.

That distinction matters. You should not alert on a rate-limit problem the way you alert on an outage, request a higher account limit when the real issue is temporary saturation, or treat shared-fleet load shedding as a platform incident when the right response is a retry or a higher-priority path. Serverless 2.0 makes these cases explicit so you can write correct retry logic, configure cleaner alerts, and choose the right serving path for each workload.

If you are coming from OpenAI-compatible APIs

The parameter names will look familiar, but the semantics are Fireworks-specific. The one to know: service_tier: "priority" is the reliability path here, not the speed knob. On OpenAI, Priority targets latency; on Fireworks, it reduces overload rejections during shared-fleet congestion. For faster tokens, use a Fast model ID. Everything else runs from the same API surface.

Why we built it this way

The old serverless tradeoff was too binary: use the shared fleet as-is, or move to a dedicated deployment with reserved capacity. Dedicated is powerful, but it asks you to know your traffic shape, reserve GPUs, and manage capacity earlier than many teams want to, and that does not match how modern AI products get built. Agentic workloads are bursty. Model choices change. Traffic grows unevenly. Teams want to stay pay-per-token while they are still learning what production demand looks like.

Serverless 2.0 gives those teams more room before they need dedicated capacity. Start on Standard. Add Priority when overload behavior matters. Switch to Fast when wall-clock latency matters. Reserve capacity when you need hard guarantees.

What this looks like in production

Teams running concurrent agent workflows are exactly who these controls are built for: workloads that need consistent, high-throughput inference but may not be ready to commit to dedicated capacity up front.

Innovative Solutions runs many concurrent AI-agent workflows per engineer on Fireworks inference. Their team changes models frequently, so the platform has to make model switching feel operationally boring.

Customer story · Innovative Solutions

Fireworks won simply because it worked consistently. Whenever we deploy any model, it works the first time. No tuning, no fiddling. That mattered to us, because we change models all the time. What I don't want is to get stuck in a 3-week development cycle trying to make a model work.

Justin CopieCEO, Innovative Solutions

3×Project capacity / eng

38%Faster to production

Serverless 2.0 is built for the next version of that workload: the same serverless simplicity, with more control over reliability and throughput.

How to get started

Standard is already the default. Every existing API call continues to work as-is.

response = client.chat.completions.create(
    model="accounts/fireworks/models/kimi-k2p6",
    messages=[{"role": "user", "content": "Hello"}]
)

Priority: add service_tier="priority". Works on both the OpenAI- and Anthropic-compatible APIs:

python

response = client.chat.completions.create(
    model="accounts/fireworks/models/kimi-k2p6",
    service_tier="priority",
    messages=[{"role": "user", "content": "Hello"}]
)

Fast: switch to a Fast model ID:

response = client.chat.completions.create(
    model="accounts/fireworks/routers/kimi-k2p6-turbo",
    messages=[{"role": "user", "content": "Hello"}]
)

Available Fast models:

Model	Model ID
Kimi K2.6 Turbo	`accounts/fireworks/routers/kimi-k2p6-turbo`
GLM 5.1 Fast	`accounts/fireworks/routers/glm-5p1-fast`

Pricing

For the most up-to-date model availability and pricing, see the Serverless pricing page. Pricing and availability may change.

What's next

Serverless 2.0 is the first step in a broader roadmap for production serverless inference. Up next: higher rate limits, region-based routing, and reservation management for teams that want more control as they scale from shared serverless to reserved capacity.

One more thing

Background

Not everything needs an answer right now. Batch jobs, report generation, offline analysis, and data enrichment can run async and they shouldn't have to pay synchronous prices.

So we're previewing Background: queued async processing at roughly ¼ of Standard pricing. Background jobs are not subject to the same synchronous request-rate limits as Standard; they are queued and processed asynchronously, subject to account limits, model availability, and abuse protections.

Background is in Preview and enabled per account. Contact us to turn it on for your workloads.