QuantizeLinear¶
QuantizeLinear - 24¶
Version¶
name: QuantizeLinear (GitHub)
domain:
mainsince_version:
24function:
Falsesupport_level:
SupportType.COMMONshape inference:
True
This version of the operator has been available since version 24.
Summary¶
The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the
low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization
granularity. The quantization formula is y = saturate((x / y_scale) + y_zero_point).
Saturation is done according to:
uint16: [0, 65535]
int16: [-32768, 32767]
uint8: [0, 255]
int8: [-128, 127]
uint4: [0, 15]
int4: [-8, 7]
For (x / y_scale), it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
y_zero_point and y must have the same type. y_zero_point is usually not used for quantization to float8 and 4bit types, but the quantization
formula remains the same for consistency, and the type of the attribute y_zero_point still determines the quantization type.
x and y_scale are allowed to have different types. The type of y_scale determines the precision of the division operation between x and
y_scale, unless the precision attribute is specified.
There are three supported quantization granularities, determined by the shape of y_scale.
In all cases, y_zero_point must have the same shape as y_scale.
Per-tensor (per-layer) quantization:
y_scaleis a scalar.Per-axis quantization: The scale must be a 1-D tensor, with the length of the quantization axis. For an input shape
(D0, ..., Di, ..., Dn)andaxis=i,y_scaleis a 1-D tensor of lengthDi.Blocked quantization: The scale’s shape is identical to the input’s shape, except for one dimension, in which blocking is performed. Given
xshape(D0, ..., Di, ..., Dn),axis=i, and block sizeB:y_scaleshape is(D0, ..., ceil(Di/B), ..., Dn).
Attributes¶
axis - INT (default is
'1'):(Optional) The axis of the dequantizing dimension of the input tensor. Used only for per-axis and blocked quantization. Negative value means counting dimensions from the back. Accepted range is
[-r, r-1]wherer = rank(input). When the rank of the input is 1, per-tensor quantization is applied, rendering the axis unnecessary in this scenario.block_size - INT (default is
'0'):(Optional) The size of the quantization block (number of times every scale is replicated). Used only for blocked quantization. The block size is a positive integer. Given
xshape(D0, ..., Di, ..., Dn),y_scaleshape(S0, ... Si, ...Sn)andaxis=i, the accepted range is[ceil(Di/Si), ceil(Di/(Si-1))-1]output_dtype - INT (default is
'0'):(Optional) The output data type. If not supplied, the output data type is inferred from
y_zero_pointdata type (T3). If neitheroutput_dtypenory_zero_pointare supplied, output data type is uint8. If bothoutput_dtypeandy_zero_pointare specified,output_dtypemust beT3.precision - INT (default is
'0'):(Optional) The precision of the division operation between
xandy_scale. If not provided, it will be the same as the type ofy_scale.saturate - INT (default is
'1'):The parameter defines how the conversion behaves if an input value is out of range of the destination type. It only applies for float 8 quantization (float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz). It is true by default. All cases are fully described in two tables inserted in the operator description.
Inputs¶
Between 2 and 3 inputs.
x (heterogeneous) - T1:
N-D full precision Input tensor to be quantized.
y_scale (heterogeneous) - T2:
Scale for doing quantization to get
y. For per-tensor/layer quantization the scale is a scalar, for per-axis quantization it is a 1-D Tensor and for blocked quantization it has the same shape as the input, except for one dimension in which blocking is performed.y_zero_point (optional, heterogeneous) - T3:
Zero point for doing quantization to get
y. Shape must matchy_scale. Default is uint8 with zero point of 0 if it’s not specified.
Outputs¶
y (heterogeneous) - T3:
N-D quantized output tensor. It has same shape as input
x.
Type Constraints¶
T1 in (
tensor(bfloat16),tensor(float),tensor(float16),tensor(int32)):The type of the input ‘x’.
T2 in (
tensor(bfloat16),tensor(float),tensor(float16),tensor(float8e8m0),tensor(int32)):The type of the input ‘y_scale’.
T3 in (
tensor(float4e2m1),tensor(float8e4m3fn),tensor(float8e4m3fnuz),tensor(float8e5m2),tensor(float8e5m2fnuz),tensor(int16),tensor(int4),tensor(int8),tensor(uint16),tensor(uint4),tensor(uint8)):The type of the input
y_zero_pointand the outputy.
QuantizeLinear - 23¶
Version¶
name: QuantizeLinear (GitHub)
domain:
mainsince_version:
23function:
Falsesupport_level:
SupportType.COMMONshape inference:
True
This version of the operator has been available since version 23.
Summary¶
The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the
low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization
granularity. The quantization formula is y = saturate((x / y_scale) + y_zero_point).
Saturation is done according to:
uint16: [0, 65535]
int16: [-32768, 32767]
uint8: [0, 255]
int8: [-128, 127]
uint4: [0, 15]
int4: [-8, 7]
For (x / y_scale), it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
y_zero_point and y must have the same type. y_zero_point is usually not used for quantization to float8 and 4bit types, but the quantization
formula remains the same for consistency, and the type of the attribute y_zero_point still determines the quantization type.
x and y_scale are allowed to have different types. The type of y_scale determines the precision of the division operation between x and
y_scale, unless the precision attribute is specified.
There are three supported quantization granularities, determined by the shape of y_scale.
In all cases, y_zero_point must have the same shape as y_scale.
Per-tensor (per-layer) quantization:
y_scaleis a scalar.Per-axis quantization: The scale must be a 1-D tensor, with the length of the quantization axis. For an input shape
(D0, ..., Di, ..., Dn)andaxis=i,y_scaleis a 1-D tensor of lengthDi.Blocked quantization: The scale’s shape is identical to the input’s shape, except for one dimension, in which blocking is performed. Given
xshape(D0, ..., Di, ..., Dn),axis=i, and block sizeB:y_scaleshape is(D0, ..., ceil(Di/B), ..., Dn).
Attributes¶
axis - INT (default is
'1'):(Optional) The axis of the dequantizing dimension of the input tensor. Used only for per-axis and blocked quantization. Negative value means counting dimensions from the back. Accepted range is
[-r, r-1]wherer = rank(input). When the rank of the input is 1, per-tensor quantization is applied, rendering the axis unnecessary in this scenario.block_size - INT (default is
'0'):(Optional) The size of the quantization block (number of times every scale is replicated). Used only for blocked quantization. The block size is a positive integer. Given
xshape(D0, ..., Di, ..., Dn),y_scaleshape(S0, ... Si, ...Sn)andaxis=i, the accepted range is[ceil(Di/Si), ceil(Di/(Si-1))-1]output_dtype - INT (default is
'0'):(Optional) The output data type. If not supplied, the output data type is inferred from
y_zero_pointdata type (T3). If neitheroutput_dtypenory_zero_pointare supplied, output data type is uint8. If bothoutput_dtypeandy_zero_pointare specified,output_dtypemust beT3.precision - INT (default is
'0'):(Optional) The precision of the division operation between
xandy_scale. If not provided, it will be the same as the type ofy_scale.saturate - INT (default is
'1'):The parameter defines how the conversion behaves if an input value is out of range of the destination type. It only applies for float 8 quantization (float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz). It is true by default. All cases are fully described in two tables inserted in the operator description.
Inputs¶
Between 2 and 3 inputs.
x (heterogeneous) - T1:
N-D full precision Input tensor to be quantized.
y_scale (heterogeneous) - T2:
Scale for doing quantization to get
y. For per-tensor/layer quantization the scale is a scalar, for per-axis quantization it is a 1-D Tensor and for blocked quantization it has the same shape as the input, except for one dimension in which blocking is performed.y_zero_point (optional, heterogeneous) - T3:
Zero point for doing quantization to get
y. Shape must matchy_scale.Default is uint8 with zero point of 0 if it’s not specified.
Outputs¶
y (heterogeneous) - T3:
N-D quantized output tensor. It has same shape as input
x.
Type Constraints¶
T1 in (
tensor(bfloat16),tensor(float),tensor(float16),tensor(int32)):The type of the input ‘x’.
T2 in (
tensor(bfloat16),tensor(float),tensor(float16),tensor(int32)):The type of the input ‘y_scale’.
T3 in (
tensor(float4e2m1),tensor(float8e4m3fn),tensor(float8e4m3fnuz),tensor(float8e5m2),tensor(float8e5m2fnuz),tensor(int16),tensor(int4),tensor(int8),tensor(uint16),tensor(uint4),tensor(uint8)):The type of the input
y_zero_pointand the outputy.
QuantizeLinear - 21¶
Version¶
name: QuantizeLinear (GitHub)
domain:
mainsince_version:
21function:
Falsesupport_level:
SupportType.COMMONshape inference:
True
This version of the operator has been available since version 21.
Summary¶
The linear quantization operator consumes a high-precision tensor, a scale, and a zero point to compute the
low-precision/quantized tensor. The scale factor and zero point must have the same shape, determining the quantization
granularity. The quantization formula is y = saturate((x / y_scale) + y_zero_point).
Saturation is done according to:
uint16: [0, 65535]
int16: [-32768, 32767]
uint8: [0, 255]
int8: [-128, 127]
uint4: [0, 15]
int4: [-8, 7] For
(x / y_scale), it rounds to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.y_zero_pointandymust have the same type.y_zero_pointis usually not used for quantization to float8 types, but the quantization formula remains the same for consistency, and the type of the attributey_zero_pointstill determines the quantization type. There are three supported quantization granularities, determined by the shape ofy_scale. In all cases,y_zero_pointmust have the same shape asy_scale.Per-tensor (per-layer) quantization:
y_scaleis a scalar.Per-axis quantization: The scale must be a 1-D tensor, with the length of the quantization axis. For an input shape
(D0, ..., Di, ..., Dn)andaxis=i,y_scaleis a 1-D tensor of lengthDi.Blocked quantization: The scale’s shape is identical to the input’s shape, except for one dimension, in which blocking is performed. Given
xshape(D0, ..., Di, ..., Dn),axis=i, and block sizeB:y_scaleshape is(D0, ..., ceil(Di/B), ..., Dn).
Attributes¶
axis - INT (default is
'1'):(Optional) The axis of the dequantizing dimension of the input tensor. Used only for per-axis and blocked quantization. Negative value means counting dimensions from the back. Accepted range is
[-r, r-1]wherer = rank(input). When the rank of the input is 1, per-tensor quantization is applied, rendering the axis unnecessary in this scenario.block_size - INT (default is
'0'):(Optional) The size of the quantization block (number of times every scale is replicated). Used only for blocked quantization. The block size is a positive integer. Given
xshape(D0, ..., Di, ..., Dn),y_scaleshape(S0, ... Si, ...Sn)andaxis=i, the accepted range is[ceil(Di/Si), ceil(Di/(Si-1))-1]output_dtype - INT (default is
'0'):(Optional) The output data type. If not supplied, the output data type is inferred from
y_zero_pointdata type (T2). If neitheroutput_dtypenory_zero_pointare supplied, output data type is uint8. If bothoutput_dtypeandy_zero_pointare specified,output_dtypemust beT2.saturate - INT (default is
'1'):The parameter defines how the conversion behaves if an input value is out of range of the destination type. It only applies for float 8 quantization (float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz). It is true by default. All cases are fully described in two tables inserted in the operator description.
Inputs¶
Between 2 and 3 inputs.
x (heterogeneous) - T1:
N-D full precision Input tensor to be quantized.
y_scale (heterogeneous) - T1:
Scale for doing quantization to get
y. For per-tensor/layer quantization the scale is a scalar, for per-axis quantization it is a 1-D Tensor and for blocked quantization it has the same shape as the input, except for one dimension in which blocking is performed.y_zero_point (optional, heterogeneous) - T2:
Zero point for doing quantization to get
y. Shape must matchy_scale.Default is uint8 with zero point of 0 if it’s not specified.
Outputs¶
y (heterogeneous) - T2:
N-D quantized output tensor. It has same shape as input
x.
Type Constraints¶
T1 in (
tensor(bfloat16),tensor(float),tensor(float16),tensor(int32)):The type of the input ‘x’.
T2 in (
tensor(float8e4m3fn),tensor(float8e4m3fnuz),tensor(float8e5m2),tensor(float8e5m2fnuz),tensor(int16),tensor(int4),tensor(int8),tensor(uint16),tensor(uint4),tensor(uint8)):The type of the input
y_zero_pointand the outputy.
QuantizeLinear - 19¶
Version¶
name: QuantizeLinear (GitHub)
domain:
mainsince_version:
19function:
Falsesupport_level:
SupportType.COMMONshape inference:
True
This version of the operator has been available since version 19.
Summary¶
The linear quantization operator. It consumes a high precision tensor, a scale, and a zero point to compute the low precision / quantized tensor.
The scale factor and zero point must have same shape, and can be either a scalar for per-tensor / per layer quantization, or a 1-D tensor for per-axis quantization.
The quantization formula is y = saturate ((x / y_scale) + y_zero_point).
For saturation, it saturates to [0, 255] if it’s uint8, or [-128, 127] if it’s int8.
For (x / y_scale), it’s rounding to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
‘y_zero_point’ and ‘y’ must have same type.
‘y_zero_point’ is usually not used for quantization to float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz,
but the quantization formula remains the same for consistency and
the type of the attribute ‘y_zero_point’ still determines the quantization type.
Attributes¶
axis - INT (default is
'1'):(Optional) The axis of the quantization dimension of the input tensor. Ignored for per-tensor quantization. Negative value means counting dimensions from the back. Accepted range is [-r, r-1] where r = rank(input).
saturate - INT (default is
'1'):The parameter defines how the conversion behaves if an input value is out of range of the destination type. It only applies for float 8 quantization (float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz). It is true by default. All cases are fully described in two tables inserted in the operator description.
Inputs¶
Between 2 and 3 inputs.
x (heterogeneous) - T1:
N-D full precision Input tensor to be quantized.
y_scale (heterogeneous) - T1:
Scale for doing quantization to get ‘y’. It can be a scalar, which means per-tensor/layer quantization, or a 1-D Tensor for per-axis quantization.
y_zero_point (optional, heterogeneous) - T2:
Zero point for doing quantization to get ‘y’. Shape must match y_scale. Default is uint8 with zero point of 0 if it’s not specified.
Outputs¶
y (heterogeneous) - T2:
N-D quantized output tensor. It has same shape as input ‘x’.
Type Constraints¶
T1 in (
tensor(bfloat16),tensor(float),tensor(float16),tensor(int32)):Constrain ‘x’ to float, float16, bfloat16 or int32 tensor.
T2 in (
tensor(float8e4m3fn),tensor(float8e4m3fnuz),tensor(float8e5m2),tensor(float8e5m2fnuz),tensor(int8),tensor(uint8)):Constrain ‘y_zero_point’ and ‘y’ to 8-bit integer/float tensor.
QuantizeLinear - 13¶
Version¶
name: QuantizeLinear (GitHub)
domain:
mainsince_version:
13function:
Falsesupport_level:
SupportType.COMMONshape inference:
True
This version of the operator has been available since version 13.
Summary¶
The linear quantization operator. It consumes a high precision tensor, a scale, and a zero point to compute the low precision / quantized tensor. The scale factor and zero point must have same shape, and can be either a scalar for per-tensor / per layer quantization, or a 1-D tensor for per-axis quantization. The quantization formula is y = saturate ((x / y_scale) + y_zero_point). For saturation, it saturates to [0, 255] if it’s uint8, or [-128, 127] if it’s int8. For (x / y_scale), it’s rounding to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details. ‘y_zero_point’ and ‘y’ must have same type.
Attributes¶
axis - INT (default is
'1'):(Optional) The axis of the quantization dimension of the input tensor. Ignored for per-tensor quantization. Negative value means counting dimensions from the back. Accepted range is [-r, r-1] where r = rank(input).
Inputs¶
Between 2 and 3 inputs.
x (heterogeneous) - T1:
N-D full precision Input tensor to be quantized.
y_scale (heterogeneous) - tensor(float):
Scale for doing quantization to get ‘y’. It can be a scalar, which means per-tensor/layer quantization, or a 1-D Tensor for per-axis quantization.
y_zero_point (optional, heterogeneous) - T2:
Zero point for doing quantization to get ‘y’. Shape must match y_scale. Default is uint8 with zero point of 0 if it’s not specified.
Outputs¶
y (heterogeneous) - T2:
N-D quantized output tensor. It has same shape as input ‘x’.
Type Constraints¶
T1 in (
tensor(float),tensor(int32)):Constrain ‘x’ to float or int32 tensor.
T2 in (
tensor(int8),tensor(uint8)):Constrain ‘y_zero_point’ and ‘y’ to 8-bit integer tensor.
QuantizeLinear - 10¶
Version¶
name: QuantizeLinear (GitHub)
domain:
mainsince_version:
10function:
Falsesupport_level:
SupportType.COMMONshape inference:
True
This version of the operator has been available since version 10.
Summary¶
The linear per-tensor/layer quantization operator. It consumes a high precision tensor, a scale, a zero point to compute the low precision / quantized tensor. The quantization formula is y = saturate ((x / y_scale) + y_zero_point). For saturation, it saturates to [0, 255] if it’s uint8, or [-128, 127] if it’s int8. For (x / y_scale), it’s rounding to the nearest even. Refer to https://en.wikipedia.org/wiki/Rounding for details. ‘y_zero_point’ and ‘y’ must have same type.
Inputs¶
Between 2 and 3 inputs.
x (heterogeneous) - T1:
N-D full precision Input tensor to be quantized.
y_scale (heterogeneous) - tensor(float):
Scale for doing quantization to get ‘y’. It’s a scalar, which means a per-tensor/layer quantization.
y_zero_point (optional, heterogeneous) - T2:
Zero point for doing quantization to get ‘y’. It’s a scalar, which means a per-tensor/layer quantization. Default value is uint8 typed 0 if it’s not specified.
Outputs¶
y (heterogeneous) - T2:
N-D quantized output tensor. It has same shape as input ‘x’.
Type Constraints¶
T1 in (
tensor(float),tensor(int32)):Constrain ‘x’ to float or int32 tensor.
T2 in (
tensor(int8),tensor(uint8)):Constrain ‘y_zero_point’ and ‘y’ to 8-bit integer tensor.