How to get performance metrics for streaming responses?

On this page

Performance Metrics Overview
Using perf_metrics_in_response
Response Body Location
Example with Fireworks Build SDK
Example with cURL
Available Metrics

Performance Metrics Overview

The Inference API returns several per-request metrics in the response. They can be useful for one-off debugging or can be logged by the client in their preferred observability tool. For aggregate metrics, see the usage dashboard. Non-streaming requests: Performance metrics are always included in response headers (e.g., fireworks-prompt-tokens, fireworks-server-time-to-first-token). Streaming requests: Only selected performance metrics, such as “fireworks-server-time-to-first-token,” are available because HTTP headers must be sent before the first token can be streamed. Use the perf_metrics_in_response body parameter to include all metrics in the last SSE event of the response body.

Using `perf_metrics_in_response`

To get performance metrics for streaming responses, set the perf_metrics_in_response parameter to true in your request. This will include performance data in the response body under the perf_metrics field.

Response Body Location

For streaming responses, performance metrics are included in the response body under the perf_metrics field in the final chunk (the one with finish_reason set). This is because headers may not be accessible during streaming.

Example with Fireworks Build SDK

Python

from fireworks import LLM
import os

llm = LLM(
    model="llama-v3p1-8b-instruct",
    deployment_type="serverless",
    api_key=os.environ["FIREWORKS_API_KEY"],
)

# Streaming completion with performance metrics
stream = llm.chat.completions.create(
    messages=[{"role": "user", "content": "Hello, world!"}],
    max_tokens=100,
    stream=True,
    perf_metrics_in_response=True,
)

for chunk in stream:
    # Use getattr to avoid linter errors for unknown attributes
    perf_metrics = getattr(chunk, "perf_metrics", None)
    finish_reason = getattr(chunk.choices[0], "finish_reason", None)
    if perf_metrics is not None and finish_reason:
        print("Performance metrics:", perf_metrics)

Example with cURL

curl -X POST "https://api.fireworks.ai/inference/v1/completions" \
  -H "Authorization: Bearer $FIREWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "accounts/fireworks/models/llama-v3p1-8b-instruct",
    "prompt": "The quick brown fox",
    "max_tokens": 100,
    "stream": true,
    "perf_metrics_in_response": true
  }'

Available Metrics

For detailed information about all available performance metrics, see the API reference documentation.

What are the maximum completion token limits for models, and can they be increased?

Can I generate multiple images in a single API call using FLUX serverless?

Account & Access

Billing & Pricing

Deployment & Infrastructure

Models & Inference

Fine-tuning

Security & Compliance

Support & General

How to get performance metrics for streaming responses?

Performance Metrics Overview

Using `perf_metrics_in_response`

Response Body Location

Example with Fireworks Build SDK

Example with cURL

Available Metrics

Account & Access

Billing & Pricing

Deployment & Infrastructure

Models & Inference

Fine-tuning

Security & Compliance

Support & General

​Performance Metrics Overview

​Using perf_metrics_in_response

​Response Body Location

​Example with Fireworks Build SDK

​Example with cURL

​Available Metrics

Performance Metrics Overview

Using `perf_metrics_in_response`

Response Body Location

Example with Fireworks Build SDK

Example with cURL

Available Metrics