Overview

This guide shows how to deploy Fireworks AI inference on Amazon SageMaker, including:
  • Preparing your AWS account and checking SageMaker quotas
  • Packaging and uploading your model to Amazon S3
  • Pushing the Fireworks Docker image to Amazon ECR
  • Creating a SageMaker endpoint (single replica, multi-replica, or sharded)
  • Testing the endpoint via SageMaker Runtime
Make sure that you use the same region for all steps in this guide (S3 bucket, ECR repository, and SageMaker endpoint).

Prerequisites

  • Fireworks AI Docker image and metering key; please reach out to [email protected] for more information
  • AWS account with permissions for SageMaker, ECR, and S3
  • AWS CLI installed and configured
  • Docker Desktop installed and running
  • jq for JSON filtering (macOS: brew install jq)
  • git-lfs for large model files (macOS: brew install git-lfs then git lfs install)
  • Optional: uv for Python virtual envs and execution (curl -LsSf https://astral.sh/uv/install.sh | sh)
This guide uses placeholder values for AWS account ID, region, and bucket name (denoted in square brackets). Please replace them with your own values.

Step 1: Obtain the Fireworks Docker image and metering key

  • Reach out to [email protected] to receive a link to the Fireworks AI Docker image, metering key, and to set up billing.
  • Keep the metering key secure; you will set it as an environment variable when deploying.
Store the metering key in a secrets manager or CI/CD secret store.

Step 2: Verify SageMaker GPU service quotas

Check your SageMaker quotas for GPU instances in your target region. Make sure to replace [YOUR_REGION] with your actual region. For best compatibility, instance types with A100, H100, or H200 GPUs are recommended (ml.p4d*, ml.p5*).
aws service-quotas list-service-quotas \
  --service-code sagemaker \
  --region [YOUR_REGION] \
  | jq '.Quotas[] | select(.QuotaName | contains("p4d") and contains("endpoint usage"))'
aws service-quotas list-service-quotas \
  --service-code sagemaker \
  --region [YOUR_REGION] \
  | jq '.Quotas[] | select(.QuotaName | contains("p5") and contains("endpoint usage"))'
Choose an instance type that matches your model’s requirements and your quota (for example, ml.p5* for H100/H200).
Insufficient quota will cause endpoint deployment failures. Request quota increases in advance if needed.

Step 3: Create an S3 bucket and upload model files

  1. Create a bucket in your region:
aws s3 mb s3://sagemaker-fireworks-model-[YOUR_AWS_ACCOUNT_ID] --region [YOUR_REGION]
  1. Download model files from Hugging Face (example: Qwen/Qwen3-8B):
brew install git-lfs
git lfs install
git clone https://huggingface.co/Qwen/Qwen3-8B
cd Qwen3-8B
  1. Add a fireworks.json describing your model configuration:
{
  "model_arch": "auto",
  "model_config_name": "auto",
  "world_size": 1,
  "checkpoint_format": "huggingface"
}
  1. Create a tar.gz archive (SageMaker expects this format) and upload to S3:
tar -czvf ../model.tar.gz --exclude='.git' .
aws s3 cp ../model.tar.gz s3://sagemaker-fireworks-model-[YOUR_AWS_ACCOUNT_ID]/Qwen3-8B/model.tar.gz
Archiving large models can take several minutes. Ensure sufficient local disk space before creating the archive.

Step 4: Create an IAM role for SageMaker

  1. In the AWS Console, open IAM → Roles → Create role.
  2. Select AWS serviceSageMakerSageMaker - Execution.
  3. Keep the default policy AmazonSageMakerFullAccess and continue.
  4. Name the role (for example, SageMakerFireworksRole) and create it.
  5. Open the role’s Summary → Add permissions → Create inline policy → JSON, then paste the following. Replace [BUCKET_NAME] with your bucket name (no s3://).
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::[BUCKET_NAME]",
        "arn:aws:s3:::[BUCKET_NAME]/*"
      ]
    }
  ]
}
Name the policy (for example, S3FireworksModelAccess) and create it.

Step 5: Push the Fireworks Docker image to ECR

  1. Create an ECR repository in your target region:
aws ecr create-repository --repository-name fireworks-sagemaker-inference --region [YOUR_REGION]
  1. Tag the Fireworks Docker image downloaded in Step 1:
docker tag [DOWNLOADED_FIREWORKS_IMAGE] \
  [YOUR_AWS_ACCOUNT_ID].dkr.ecr.[YOUR_REGION].amazonaws.com/fireworks-sagemaker-inference:[TAG]
  1. Push the image to ECR:
docker push [YOUR_AWS_ACCOUNT_ID].dkr.ecr.[YOUR_REGION].amazonaws.com/fireworks-sagemaker-inference:[TAG]

Step 6: Deploy the SageMaker endpoint

You can deploy a multi-replica orsharded endpoint. Please refer to the scripts below for more details.
Run the env_setup.sh script to set up your local environment and add FIREWORKS_METERING_KEY to your environment before running the deployment scripts.

Step 7: Test the endpoint

Once deployed, you can test your SageMaker endpoint with the following script: test_endpoint.py
You should see successful responses for both completions and chat APIs.

Troubleshooting

Quota or capacity errors

  • Symptom: Endpoint creation fails with quota or capacity messages
  • Fix: Verify p4, p5, or other GPU endpoint quotas. Request increases and retry

ECR authentication failures

  • Symptom: docker push fails with permission denied
  • Fix: Re-run ECR login and confirm repository URI, region, and account ID

S3 access denied

  • Symptom: Model fails to download during container startup
  • Fix: Ensure the IAM role inline policy includes your bucket and /* object path; Ensure that you are pointing your s3_model_path to the model.tar.gz file

Next steps

  • Integrate your application with the SageMaker endpoint via your preferred SDK
  • Reach out to your Fireworks AI contact for support with optimizing your deployment to your specific workload