Batch Inference allows you to perform Chat Completion in bulk asynchronously, reducing costs by up to 50%.

Overview

The Chat Completions API allows synchronous inference for a single request. However, if you need to process a large number of requests, our Batch Inference API is a more efficient alternative. Our API works for all 1000+ models in our Model Library, as well as your own fine-tuned models.

Use Cases

  • ETL Pipelines - Construct production pipelines around large scale inference workloads
  • Evaluations – Automate large-scale testing and benchmarking
  • Distillation – Teach a smaller model using a larger model

Cost Optimization

Batch API Advantages

  • 💸 Volume Discounts
  • Higher throughput – Process more data in less time.
Batch Inference is priced at 50% of our serverless rates:

Step-by-Step Guide to Batch Inference with Fireworks AI

1. Preparing the Dataset

Datasets must adhere strictly to the JSONL format, where each line represents a complete JSON-formatted inference request. Requirements:
  • File format: JSONL (each line is a valid JSON object)
  • Total size limit: Under 500MB
  • Format: OpenAI Batch API compatible format with custom_id (unique id) and body fields
Here’s an example input dataset:
{"custom_id": "request-1", "body": {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}], "max_tokens": 100}}
{"custom_id": "request-2", "body": {"messages": [{"role": "user", "content": "Explain quantum computing"}], "temperature": 0.7}}
{"custom_id": "request-3", "body": {"messages": [{"role": "user", "content": "Tell me a joke"}]}}
Save this dataset as jsonl file locally, for example batch_input_data.jsonl, making sure custom_id is unique across rows.

2. Uploading the Dataset to Fireworks AI

There are a few ways to upload the dataset to Fireworks platform for batch inference: UI, firectl or HTTP API.
You can simply navigate to the dataset tab, click Create Dataset and follow the wizard.Dataset Upload

3. Creating a Batch Inference Job

Navigate to the Batch Inference tab and click “Create Batch Inference Job”. Select your input dataset:BIJ Dataset SelectChoose your model:BIJ Model SelectConfigure optional settings:BIJ Optional Settings

4. Monitoring and Managing Batch Inference Jobs

Batch Job States

Batch Inference Jobs progress through several states during their lifecycle:
StateDescription
VALIDATINGThe input dataset is being validated to ensure it meets format requirements and constraints
PENDINGThe job is queued and waiting for available resources to begin processing
RUNNINGThe batch job is actively processing requests from the input dataset
COMPLETEDAll requests have been successfully processed and results are available in the output dataset
FAILEDThe job encountered an unrecoverable error. Check the job status message for details
EXPIREDThe job exceeded the 24-hour time limit. Any completed requests up to that point are saved to the output dataset
View all your batch inference jobs in the dashboard:BIJ List

5. Downloading the Results

After the batch inference job is complete, download the output dataset containing the results.
Navigate to the output dataset and download the results:BIJ Dataset Download

Output Files

The output dataset contains two types of files:
File TypeDescription
Results FileContains successful inference responses in JSONL format, with each line matching the custom_id from your input
Error FileContains any error details for requests that failed processing, and the original custom_id for debugging

6. Best Practices and Considerations

  • Validate your dataset thoroughly before uploading.
  • Use appropriate inference parameters for your use case.
  • Monitor job progress for long-running batches.
  • Set reasonable max_tokens limits to optimize processing time.
  • Use descriptive custom_id values for easier result tracking.

Model Availability

  • Base Models – Any Base Model in our Model Library
  • Account Models – Any model you have uploaded/trained, including fine-tuned models
Note: Newly added models may have a delay before being supported

Limits

  • Each individual request (row in the dataset) will follow the same constraints as Chat Completion Limits
  • The Input Dataset must adhere to Dataset Limits and be under 500Mb total.
  • The Output Dataset will be capped at 8GB, and the job may expire early if the limit is reached.

Batch Expiration

A Batch Job will expire if it runs for 24 hours, and any completed rows will be billed for and written to the output dataset.

Appendix

Python builder SDK references HTTP API references firectl references