AWS SageMaker

Overview

This guide shows how to deploy Fireworks AI inference on Amazon SageMaker, including:

Preparing your AWS account and checking SageMaker quotas
Packaging and uploading your model to Amazon S3
Pushing the Fireworks Docker image to Amazon ECR
Creating a SageMaker endpoint (single replica, multi-replica, or sharded)
Testing the endpoint via SageMaker Runtime

Make sure that you use the same region for all steps in this guide (S3 bucket, ECR repository, and SageMaker endpoint).

Prerequisites

Fireworks AI Docker image and metering key; please reach out to [email protected] for more information
AWS account with permissions for SageMaker, ECR, and S3
AWS CLI installed and configured
Docker Desktop installed and running
jq for JSON filtering (macOS: brew install jq)
git-lfs for large model files (macOS: brew install git-lfs then git lfs install)
Optional: uv for Python virtual envs and execution (curl -LsSf https://astral.sh/uv/install.sh | sh)

This guide uses placeholder values for AWS account ID, region, and bucket name (denoted in square brackets). Please replace them with your own values.

Step 1: Obtain the Fireworks Docker image and metering key

Reach out to [email protected] to receive a link to the Fireworks AI Docker image, metering key, and to set up billing.
Keep the metering key secure; you will set it as an environment variable when deploying.

Store the metering key in a secrets manager or CI/CD secret store.

Step 2: Verify SageMaker GPU service quotas

Check your SageMaker quotas for GPU instances in your target region. Make sure to replace [YOUR_REGION] with your actual region. For best compatibility, instance types with A100, H100, or H200 GPUs are recommended (ml.p4d*, ml.p5*).

aws service-quotas list-service-quotas \
  --service-code sagemaker \
  --region [YOUR_REGION] \
  | jq '.Quotas[] | select(.QuotaName | contains("p4d") and contains("endpoint usage"))'

aws service-quotas list-service-quotas \
  --service-code sagemaker \
  --region [YOUR_REGION] \
  | jq '.Quotas[] | select(.QuotaName | contains("p5") and contains("endpoint usage"))'

Choose an instance type that matches your model’s requirements and your quota (for example, ml.p5* for H100/H200).

Insufficient quota will cause endpoint deployment failures. Request quota increases in advance if needed.

Step 3: Create an S3 bucket and upload model files

Create a bucket in your region:

aws s3 mb s3://sagemaker-fireworks-model-[YOUR_AWS_ACCOUNT_ID] --region [YOUR_REGION]

Download model files from Hugging Face (example: Qwen/Qwen3-8B):

brew install git-lfs
git lfs install
git clone https://huggingface.co/Qwen/Qwen3-8B
cd Qwen3-8B

Inside the model directory (e.g. in Qwen3-8B), create a fireworks.json describing your model configuration:

{
  "model_arch": "auto",
  "model_config_name": "auto",
  "world_size": 1,
  "checkpoint_format": "huggingface"
}

Inside the model directory (e.g. in Qwen3-8B), upload your model files to S3:

aws s3 sync . s3://sagemaker-fireworks-model-[YOUR_AWS_ACCOUNT_ID]/Qwen3-8B/ --exclude ".git/*" --exclude ".git*"

[OPTIONAL] Add a speculator model for speculative decoding:

Speculative decoding can sometimes help improve latency for larger models by using a smaller draft model to assist the larger model in generation. To set up your S3 bucket and paths for the speculator model, do the following: a. Download the speculator model files from Hugging Face into a draft directory:

# From your base model dir (e.g., Qwen3-8B)
mkdir draft
cd draft
git lfs install

# Clone into the draft dir, not a subdir (i.e., flatten with .)
git clone https://huggingface.co/Qwen/Qwen3-0.6B .

b. Inside the draft model directory (e.g. in Qwen3-8B/draft), create a fireworks.json:

{
  "model_arch": "auto",
  "model_config_name": "auto",
  "world_size": 1,
  "checkpoint_format": "huggingface"
}

c. Sync the speculator model files to S3:

# From your draft dir (e.g., Qwen3-8B/draft)
aws s3 sync . s3://sagemaker-fireworks-model-[YOUR_AWS_ACCOUNT_ID]/Qwen3-8B/draft/ --exclude ".git/*" --exclude ".git*"

d. At deployment time, you will need to enable speculative decoding (see Step 6).

Step 4: Create an IAM role for SageMaker

In the AWS Console, open IAM → Roles → Create role.
Select AWS service → SageMaker → SageMaker - Execution.
Keep the default policy AmazonSageMakerFullAccess and continue.
Name the role (for example, SageMakerFireworksRole) and create it.
Open the role’s Summary → Add permissions → Create inline policy → JSON, then paste the following. Replace [BUCKET_NAME] with your bucket name (no s3://).

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::[BUCKET_NAME]",
        "arn:aws:s3:::[BUCKET_NAME]/*"
      ]
    }
  ]
}

Name the policy (for example, S3FireworksModelAccess) and create it.

Step 5: Push the Fireworks Docker image to ECR

Create an ECR repository in your target region:

aws ecr create-repository --repository-name fireworks-sagemaker-inference --region [YOUR_REGION]

Tag the Fireworks Docker image downloaded in Step 1:

docker tag [DOWNLOADED_FIREWORKS_IMAGE] \
  [YOUR_AWS_ACCOUNT_ID].dkr.ecr.[YOUR_REGION].amazonaws.com/fireworks-sagemaker-inference:[TAG]

Push the image to ECR:

docker push [YOUR_AWS_ACCOUNT_ID].dkr.ecr.[YOUR_REGION].amazonaws.com/fireworks-sagemaker-inference:[TAG]

Step 6: Deploy the SageMaker endpoint

You can deploy a multi-replica or sharded endpoint. Please refer to the scripts below for more details.

Local environment setup: env_setup.sh
Multi-replica deployment script: deploy_multi_gpu_replicated.py
Sharded deployment script: deploy_multi_gpu_sharded.py
Speculative decoding deployment script: deploy_spec_decode.py

Run the env_setup.sh script to set up your local environment and add FIREWORKS_METERING_KEY to your environment before running the deployment scripts.

Step 7: Test the endpoint

Once deployed, you can test your SageMaker endpoint with the following script: test_endpoint.py

You should see successful responses for both completions and chat APIs.

Troubleshooting

Quota or capacity errors

Symptom: Endpoint creation fails with quota or capacity messages
Fix: Verify p4, p5, or other GPU endpoint quotas. Request increases and retry

ECR authentication failures

Symptom: docker push fails with permission denied
Fix: Re-run ECR login and confirm repository URI, region, and account ID

S3 access denied

Symptom: Model fails to download during container startup
Fix: Ensure the IAM role inline policy includes your bucket and /* object path; Ensure that you are pointing your s3_model_path to the model.tar.gz file

Next steps

Integrate your application with the SageMaker endpoint via your preferred SDK
Reach out to your Fireworks AI contact for support with optimizing your deployment to your specific workload

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

Overview

Prerequisites

Step 1: Obtain the Fireworks Docker image and metering key

Step 2: Verify SageMaker GPU service quotas

Step 3: Create an S3 bucket and upload model files

Step 4: Create an IAM role for SageMaker

Step 5: Push the Fireworks Docker image to ECR

Step 6: Deploy the SageMaker endpoint

Step 7: Test the endpoint

Troubleshooting

Quota or capacity errors

ECR authentication failures

S3 access denied

Next steps

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​Overview

​Prerequisites

​Step 1: Obtain the Fireworks Docker image and metering key

​Step 2: Verify SageMaker GPU service quotas

​Step 3: Create an S3 bucket and upload model files

​Step 4: Create an IAM role for SageMaker

​Step 5: Push the Fireworks Docker image to ECR

​Step 6: Deploy the SageMaker endpoint

​Step 7: Test the endpoint

​Troubleshooting

​Quota or capacity errors

​ECR authentication failures

​S3 access denied

​Next steps

Overview

Prerequisites

Step 1: Obtain the Fireworks Docker image and metering key

Step 2: Verify SageMaker GPU service quotas

Step 3: Create an S3 bucket and upload model files

Step 4: Create an IAM role for SageMaker

Step 5: Push the Fireworks Docker image to ECR

Step 6: Deploy the SageMaker endpoint

Step 7: Test the endpoint

Troubleshooting

Quota or capacity errors

ECR authentication failures

S3 access denied

Next steps