Performance Metrics Overview
The Inference API returns several per-request metrics in the response. They can be useful for one-off debugging or can be logged by the client in their preferred observability tool. For aggregate metrics, see the usage dashboard. Non-streaming requests: Performance metrics are always included in response headers (e.g.,fireworks-prompt-tokens
, fireworks-server-time-to-first-token
).
Streaming requests: Only selected performance metrics, such as “fireworks-server-time-to-first-token,” are available because HTTP headers must be sent before the first token can be streamed. Use the perf_metrics_in_response
body parameter to include all metrics in the last SSE event of the response body.
Using perf_metrics_in_response
To get performance metrics for streaming responses, set the perf_metrics_in_response
parameter to true
in your request. This will include performance data in the response body under the perf_metrics
field.
Response Body Location
For streaming responses, performance metrics are included in the response body under theperf_metrics
field in the final chunk (the one with finish_reason
set). This is because headers may not be accessible during streaming.
Example with Fireworks Build SDK
Python