Quantizing a model
A model can be quantized to 8-bit floating-point (FP8) precision.- firectl
- Python (REST API)
This is an additive process that enables creating deployments with additional precisions. The original FP16 checkpoint is still available for use.
- firectl
- Python (REST API)
PREPARING
. A successfully prepared model will have the desired precision added
to the Precisions
list.
Creating an FP8 deployment
By default, creating a dedicated deployment will use the FP16 checkpoint. To see what precisions are available for a model, run:- firectl
- Python (REST API)
Precisions
field will indicate what precisions the model has been prepared for.
To use the quantized FP8 checkpoint, pass the --precision
flag:
- firectl
- Python (REST API)
Quantized deployments can only be served using H100 GPUs.