Inference Providers vs. API Routers: Where Do Your Tokens Actually Come From?

Open vs. closed models: the context

What's the actual difference?

Performance: the honest picture

Data sovereignty: the shadow traffic problem

When to use each

How to tell as a developer

Frequently asked questions

When you sign up to use an LLM via API, you're implicitly assuming that the company you’re calling is the same one executing your request. In practice, that assumption is not always correct, and in some cases, the separation matters.

Open vs. closed models: the context

Open-weight models, such as Llama, Mistral, Qwen, and DeepSeek, can be deployed and served by any party with sufficient GPU capacity and comfort with an open-source inference stack such as vLLM. In practice, this means multiple independent entities may serve the same model under different infrastructure and policies.

Closed models are different. GPT, Claude, and Gemini use proprietary weights, meaning inference is controlled by the model provider or explicitly authorized infrastructure partners. The execution layer is not broadly deployable.

So how does OpenRouter offer access to GPT-5.5? It does not run the model itself. Instead, it forwards requests to an upstream provider’s API and returns the response. It functions as a routing and aggregation layer rather than an inference layer.

Think of travel booking: Airbnb lists many boutique hotel properties and will handle your reservation. As a marketplace platform, it acts as a convenience layer. The core service you’re paying for though is hospitality. Everything from the check-in experience to the cleanliness and comfort of your bed is the responsibility of the hotel. Now apply that analogy to model inference: a routing platform provides a convenience layer with access to many models, but the “service delivery” is the execution of an input request and the delivery of output tokens, which is handled by the actual inference provider.

What's the actual difference?

Inference providers secure dedicated GPU compute and serve models directly. The company controlling the API endpoint is the same company controlling the hardware.

API routers aggregate access across multiple providers and expose a unified interface. When a request is made, it is forwarded to an upstream provider that performs the actual inference. The router itself never touches a GPU directly on your behalf.

One critical nuance: routing layers can, in some cases, forward traffic to publicly exposed API endpoints without deep integration into the provider’s infrastructure. In practice, most providers define usage policies and terms that govern how their APIs may be accessed or resold, and the well-established inference routers have formal partnership agreements in place with many of the major inference providers.

Performance: the honest picture

Proxy hops are always additive. While popular API routers typically prioritize low latency, there is no configuration in which routing through an intermediary improves your median TTFT when compared to accessing the same inference API endpoint directly.

That said, well-run routers offer something real: reliability. Services like OpenRouter maintain dozens of endpoints for the same popular models and can automatically reroute around overloaded or degraded endpoints. That's a genuine p95 improvement. By systemically providing redundancy, the end-user can expect fewer tail disasters, with less exposure to a single provider having an outage or degraded experience.

Lastly, routers have zero visibility into the GPU-level decisions that actually determine inference quality (KV cache configuration, batch scheduling, custom kernels). Those decisions are made upstream of the router and at the mercy of the inference provider.

The TLDR: routers improve tail latency; they cannot improve median latency. They're a reliability layer, but are generally incapable of improving baseline performance.

Data sovereignty: the shadow traffic problem

If we consider another analogy, a direct inference provider is like a farm-to-table restaurant: the chef may not own the farms (GPUs) or have a proprietary recipe (model weights), but she's personally selected every ingredient and is responsible for preparing the meal. In this analogy, a router is akin to DoorDash: very useful for aggregation and delivery, but with limited control over what happens in the kitchen.

However, in the fast-moving AI space, there are a number of routers today offering free tiers and below-cost pricing. Shadow traffic, which involves duplicating live requests for model evaluation or dataset collection, is a standard industry practice because of the intrinsic value of live user data for training future models. Shadow traffic is invisible in the response and leaves no trace in your logs. Especially for new or smaller API routers without established reputations to sustain, we suggest caution in accordance with the age-old tech cliche: If the product is free, you are the product.

The structural problem: a router can only bind itself. A zero-retention DPA with OpenRouter protects your data at their layer. Your request still lands at an upstream provider whose policies you haven't reviewed or signed. The agreement doesn't follow your data.

For compliance-sensitive workloads (HIPAA, GDPR, SOC 2 attestations) or those containing PII (personally identifiable information), we recommend minimizing the number of middleware layers such as 3rd party routers, as they can never be be fully controlled or monitored.

When to use each

	Few models	Many models
High data sensitivity	Direct provider only. No exceptions.	Negotiate DPAs directly with each provider.
Low data sensitivity	Either works. Direct preferred at scale.	Router is ideal: one API key, multi-provider fallback, broad model access.

Routers are a convenience tax. Pay it when the convenience is real.

How to tell as a developer

A. Read the Ts & Cs (or delegate to your AI agent — paste their docs and DPA and ask)

•Documentation language: "our clusters / our GPUs" indicates a provider; "access 200+ models from leading providers" indicates a router.
•DPA sub-processor language: references to "sub-processors" or "third-party infrastructure partners" are router tells. A direct provider's DPA has no such chain.

B. Manual server inspection

•Status page: real providers list GPU regions and infrastructure incidents. Routers only show API uptime.
•ASN lookup: use whois on the endpoint IP to see if it is a company ASN or generic cloud block.
•Latency fingerprinting: a consistent 20–80ms overhead vs. a known direct provider is the proxy hop signature.
•Response headers: x-served-by, x-upstream, or similar may leak the actual serving provider.

Popular Inference Providers and API Routers Today

Direct providers secure dedicated compute (typically via long-term GPU agreements) and serve models from that infrastructure. Routers proxy to providers they don't control.

Provider	Type	Signal
Fireworks AI	Direct provider	Secured GPU clusters, FireAttention kernel, hardware SLAs
Together AI	Direct provider	Secured data centers, custom inference kernels
Baseten	Direct provider	Dedicated model replicas on secured infrastructure
Groq	Direct provider	Proprietary LPU silicon — definitionally can't be a router
Cerebras	Direct provider	Wafer-scale chips — same logic as Groq
Replicate (Cloudflare)	Direct provider	Secured GPU fleet; cold start behavior confirms real infra
OpenRouter	Router	Multi-provider routing; model list maps to upstream APIs
Not Diamond	Gateway Router	Task-aware routing layer, no infrastructure claims
Martian	Router	Adaptive routing, same architecture
LiteLLM (cloud)	Router	OSS gateway turned managed service

Before you ship

Ask one question before you put any LLM API in your production request path: Where does my token actually get processed?

If the answer requires reading another company's DPA to complete — you're talking to a router. That might be fine for your use case. But you should know which one you're in.

Frequently asked questions

What's the difference between an inference provider and an API router? An inference provider secures dedicated GPU compute and serves models directly — the same company controls both the API endpoint and the hardware processing your tokens. An API router is a proxy layer that forwards your requests to an upstream provider; it has no compute of its own and does not process your tokens.

Is OpenRouter an inference provider?
No. OpenRouter is an API router — it aggregates access to multiple inference providers and routes your requests upstream. It does not secure or operate any GPU infrastructure.

Does routing through an API router increase latency?
Yes. Latency will always be at least mildly worse versus calling the same provider directly. Every proxy hop adds round-trip overhead, so your median TTFT (time to first token) will always be higher through a router than through a direct provider. However, routers can improve p95 reliability by automatically rerouting around saturated endpoints, but they cannot improve median latency. For workloads low latency is critical, a guaranteed small increase in latency can still be well-worth it, in order to reliably mitigate against p95 or p99 outtage risk.

Are API router DPAs enforceable for HIPAA or SOC 2 workloads? Not on their own. A router can only bind itself — it cannot enforce data retention policies on the upstream providers that actually process your tokens. For regulated workloads, you need a direct agreement with the entity whose hardware handles your data.

What is shadow traffic in the context of LLM APIs? Shadow traffic is the practice of duplicating a percentage of live API requests for an orthogonal use case, typically for new feature testing & QA, abuse detection, or model evaluation. Shadow traffic is typically non-malicious, but leaves no trace in application logs. Accordingly, it is a primary reason why zero-retention guarantees from API router services are structurally unenforceable.

How do I verify that an LLM API provider actually secures its own compute? Check their status page for GPU region and infrastructure incident history, run whois on their endpoint IP to confirm the ASN resolves to their company (not a generic cloud block), and review their DPA for sub-processor language. Documentation saying "access 200+ models from leading providers" is a router tell; "our GPU clusters" or "our inference stack" points to a direct provider.

Inference Providers vs. API Routers: where do tokens come from?

Table of Contents

Table of Contents

Open vs. closed models: the context

What's the actual difference?

Performance: the honest picture

Data sovereignty: the shadow traffic problem

When to use each

How to tell as a developer

Popular Inference Providers and API Routers Today

Before you ship

Frequently asked questions