When you have dedicated deployments that were created via firectl or the Fireworks web UI, you can easily connect to them using the Build SDK to run inference. This is particularly useful when you want to leverage existing infrastructure or when deployments are managed by different teams.

Prerequisites

Before you begin, make sure you have:
  • An existing dedicated deployment running on Fireworks
  • The deployment ID or name
  • Your Fireworks API key configured
You can find your deployment ID in the Fireworks dashboard under the deployments section.

Connecting to an existing deployment

To query an existing dedicated deployment, you simply need to create an LLM instance with the deployment_type="on-demand" and provide the deployment id:
from fireworks import LLM

# Connect to your existing dedicated deployment
llm = LLM(
    model="llama-v3p2-3b-instruct",  # The model your deployment is running
    deployment_type="on-demand",
    id="my-custom-deployment",  # Your deployment ID
)

# Start using the deployment immediately - no .apply() needed
response = llm.chat.completions.create(
    messages=[{"role": "user", "content": "Hello from my dedicated deployment!"}]
)

print(response.choices[0].message.content)
Since you’re connecting to an existing deployment, you don’t need to call .apply() - the deployment is already running and ready to serve requests.

Important considerations

No resource creation

When connecting to existing deployments:
  • No new resources are created - The SDK connects to your existing deployment
  • No .apply() call needed - The deployment is already active
  • Immediate availability - You can start making inference calls right away

Deployment ID requirements

The id parameter should match exactly with your existing deployment:
  • Use the deployment name/ID as shown in the Fireworks dashboard
  • The ID is case-sensitive and must match exactly
  • If the deployment doesn’t exist, you’ll receive an error when making requests

Model specification

While you need to specify the model parameter, it should match the model that your deployment is actually running:
# If your deployment is running Llama 3.2 3B Instruct
llm = LLM(
    model="llama-v3p2-3b-instruct",
    deployment_type="on-demand", 
    id="production-llama-deployment"
)

# If your deployment is running Qwen 2.5 72B Instruct
llm = LLM(
    model="qwen2p5-72b-instruct",
    deployment_type="on-demand",
    id="qwen-high-capacity-deployment"
)

Complete example

Here’s a complete example that demonstrates connecting to an existing deployment and using it for a conversation:
from fireworks import LLM

# Connect to existing deployment
llm = LLM(
    model="llama-v3p2-3b-instruct",
    deployment_type="on-demand",
    id="my-existing-deployment",
)

# Use OpenAI-compatible chat completions
response = llm.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    max_tokens=150,
    temperature=0.7
)

print(response.choices[0].message.content)

Troubleshooting

Common issues and solutions

Next steps

Now that you can connect to existing deployments, you might want to: