Overview
Fireworks provides a metrics endpoint in Prometheus format, enabling integration with popular observability tools like Prometheus, OpenTelemetry (OTel) Collector, Datadog Agent, and Vector.Setting Up Metrics Collection
Endpoint
The metrics endpoint is as follows. This URL and authorization header can be directly used by services like Grafana Cloud to ingest Fireworks metrics.Authentication
Use the Authorization header with your Fireworks API key:Scrape Interval
We recommend using a 1-minute scrape interval as metrics are updated every 30s.Supported Integrations
Fireworks metrics can be collected via OpenTelemetry Collector and exported to various observability platforms including:- Prometheus
- Datadog
- Grafana
- New Relic
Prometheus Integration
To integrate with Prometheus, specify the Fireworks metrics endpoint in your scrape config. You will then need to configure a Prometheus receiver to scrape the endpoint and push it to services like Datadog.Ratelimits
To ensure service stability and fair usage:- Maximum of 6 requests per minute per account
- Exceeding this limit results in HTTP 429 (Too Many Requests) responses
- Use a 1-minute scrape interval to stay within limits
Available Metrics
Common Labels
All metrics include the following common labels:base_model
: The base model identifier (e.g., “accounts/fireworks/models/deepseek-v3”)deployment
: Full deployment path (e.g., “accounts/account-name/deployments/deployment-id”)deployment_account
: The account namedeployment_id
: The deployment identifier
Rate Metrics (per second)
These metrics show activity rates calculated using 1-minute windows:Request Rate
request_counter_total:sum_by_deployment
: Request rate per deployment
Token Processing Rates
tokens_cached_prompt_total:sum_by_deployment
: Rate of cached prompt tokens per deploymenttokens_prompt_total:sum_by_deployment
: Rate of total prompt tokens processed per deployment
Latency Histogram Metrics
These metrics provide latency distribution data with histogram buckets, calculated using 1-minute windows:Generation Latency
latency_generation_per_token_ms_bucket:sum_by_deployment
: Per-token generation time distributionlatency_generation_queue_ms_bucket:sum_by_deployment
: Time spent waiting in generation queue
Request Latency
latency_overall_ms_bucket:sum_by_deployment
: End-to-end request latency distributionlatency_to_first_token_ms_bucket:sum_by_deployment
: Time to first token distribution
Prefill Latency
latency_prefill_ms_bucket:sum_by_deployment
: Prefill processing time distributionlatency_prefill_queue_ms_bucket:sum_by_deployment
: Time spent waiting in prefill queue
Token Distribution Metrics
These histogram metrics show token count distributions per request, calculated using 1-minute windows:tokens_generated_per_request_bucket:sum_by_deployment
: Distribution of generated tokens per requesttokens_prompt_per_request_bucket:sum_by_deployment
: Distribution of prompt tokens per request
Resource Utilization Metrics
These gauge metrics show average resource usage:generator_kv_blocks_fraction:avg_by_deployment
: Average fraction of KV cache blocks in usegenerator_kv_slots_fraction:avg_by_deployment
: Average fraction of KV cache slots in usegenerator_model_forward_time:avg_by_deployment
: Average time spent in model forward passrequests_coordinator_concurrent_count:avg_by_deployment
: Average number of concurrent requestsprefiller_prompt_cache_ttl:avg_by_deployment
: Average prompt cache time-to-live