Deploying LoRAs

After fine-tuning your model, deploy it to make it available for inference.

You can also upload and deploy LoRA models fine-tuned outside of Fireworks. See importing fine-tuned models for details.

Single-LoRA deployment

Deploy your LoRA fine-tuned model with a single command that delivers performance matching the base model. This streamlined approach, called live merge, eliminates the previous two-step process and provides better performance compared to multi-LoRA deployments.

Quick deployment

Deploy your LoRA fine-tuned model with one simple command:

firectl create deployment "accounts/fireworks/models/<MODEL_ID of lora model>"

Your deployment will be ready to use once it completes, with performance that matches the base model.

Deployment with the Build SDK

You can also deploy your LoRA fine-tuned model using the Build SDK:

from fireworks import LLM

# Deploy a fine-tuned model with on-demand deployment (live merge)
fine_tuned_llm = LLM(
    model="accounts/your-account/models/your-fine-tuned-model-id",
    deployment_type="on-demand",
    id="my-fine-tuned-deployment"  # Simple string identifier
)

# Apply the deployment to ensure it's ready
fine_tuned_llm.apply()

# Use the deployed model
response = fine_tuned_llm.chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}]
)

# Track deployment in web dashboard
print(f"Track at: {fine_tuned_llm.deployment_url}")

The id parameter can be any simple string - it does not need to follow the format "accounts/account_id/deployments/model_id".

Multi-LoRA deployment

If you have multiple fine-tuned versions of the same base model (e.g., you’ve fine-tuned the same model for different use cases, applications, or prototyping), you can share a single base model deployment across these LoRA models to achieve higher utilization.

Multi-LoRA deployment comes with performance tradeoffs. We recommend using it only if you need to serve multiple fine-tunes of the same base model and are willing to trade performance for higher deployment utilization.

Deploy with CLI

Create base model deployment

Deploy the base model with addons enabled:

firectl create deployment "accounts/fireworks/models/<MODEL_ID of base model>" --enable-addons

Load LoRA addons

Once the deployment is ready, load your LoRA models onto the deployment:

firectl load-lora <FINE_TUNED_MODEL_ID> --deployment <DEPLOYMENT_ID>

You can load multiple LoRA models onto the same deployment by repeating this command with different model IDs.

Deploy with the Build SDK

You can also use multi-LoRA deployment with the Build SDK:

from fireworks import LLM

# Create a base model deployment with addons enabled
base_model = LLM(
    model="accounts/fireworks/models/base-model-id",
    deployment_type="on-demand",
    id="shared-base-deployment",  # Simple string identifier
    enable_addons=True
)
base_model.apply()

# Deploy multiple fine-tuned models using the same base deployment
fine_tuned_model_1 = LLM(
    model="accounts/your-account/models/fine-tuned-model-1",
    deployment_type="on-demand-lora",
    base_id=base_model.deployment_id
)

fine_tuned_model_2 = LLM(
    model="accounts/your-account/models/fine-tuned-model-2", 
    deployment_type="on-demand-lora",
    base_id=base_model.deployment_id
)

# Apply deployments
fine_tuned_model_1.apply()
fine_tuned_model_2.apply()

# Use the deployed models
response_1 = fine_tuned_model_1.chat.completions.create(
    messages=[{"role": "user", "content": "Hello from model 1!"}]
)

response_2 = fine_tuned_model_2.chat.completions.create(
    messages=[{"role": "user", "content": "Hello from model 2!"}]
)

When using deployment_type="on-demand-lora", you need to provide the base_id parameter that references the deployment ID of your base model deployment.

When to use multi-LoRA deployment

Use multi-LoRA deployment when you:

Need to serve multiple fine-tuned models based on the same base model
Want to maximize deployment utilization
Can accept some performance tradeoff compared to single-LoRA deployment
Are managing multiple variants or experiments of the same model

Serverless deployment

For quick experimentation and prototyping, you can deploy your fine-tuned model to shared serverless infrastructure without managing GPUs.

Not all base models support serverless addons. Check the list of models that support serverless with LoRA to confirm your base model is supported.

Deploy to serverless

Load your fine-tuned model into a serverless deployment:

firectl load-lora <FINE_TUNED_MODEL_ID>

Key considerations

No hosting costs: Deploying to serverless is free—you only pay per-token usage costs
Rate limits: Same rate limits apply as serverless base models
Performance: Lower performance than on-demand deployments and the base model
Automatic unloading: Unused addons may be automatically unloaded after a week
Limit: Deploy up to 100 fine-tuned models to serverless

For production workloads requiring consistent performance, use on-demand deployments instead.

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

Single-LoRA deployment

Quick deployment

Deployment with the Build SDK

Multi-LoRA deployment

Deploy with CLI

Deploy with the Build SDK

When to use multi-LoRA deployment

Serverless deployment

Deploy to serverless

Key considerations

Next steps

On-Demand Deployments

Import Fine-Tuned Models

Get Started

Deployments

Models & Inference

Fine Tuning

Administration

Security & Compliance

Integrations

​Single-LoRA deployment

​Quick deployment

​Deployment with the Build SDK

​Multi-LoRA deployment

​Deploy with CLI

​Deploy with the Build SDK

​When to use multi-LoRA deployment

​Serverless deployment

​Deploy to serverless

​Key considerations

​Next steps

On-Demand Deployments

Import Fine-Tuned Models

Single-LoRA deployment

Quick deployment

Deployment with the Build SDK

Multi-LoRA deployment

Deploy with CLI

Deploy with the Build SDK

When to use multi-LoRA deployment

Serverless deployment

Deploy to serverless

Key considerations

Next steps