(l-onnx-doc-QLinearMatMul)= # QLinearMatMul (l-onnx-op-qlinearmatmul-21)= ## QLinearMatMul - 21 ### Version - **name**: [QLinearMatMul (GitHub)](https://github.com/onnx/onnx/blob/main/docs/Operators.md#QLinearMatMul) - **domain**: `main` - **since_version**: `21` - **function**: `False` - **support_level**: `SupportType.COMMON` - **shape inference**: `True` This version of the operator has been available **since version 21**. ### Summary Matrix product that behaves like [numpy.matmul](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html). It consumes two quantized input tensors, their scales and zero points, scale and zero point of output, and computes the quantized output. The quantization formula is y = saturate((x / y_scale) + y_zero_point). For (x / y_scale), it is rounding to nearest ties to even. Refer to https://en.wikipedia.org/wiki/Rounding for details. Scale and zero point must have same shape. They must be either scalar (per tensor) or N-D tensor (per row for 'a' and per column for 'b'). Scalar refers to per tensor quantization whereas N-D refers to per row or per column quantization. If the input is 2D of shape [M, K] then zero point and scale tensor may be an M element vector [v_1, v_2, ..., v_M] for per row quantization and K element vector of shape [v_1, v_2, ..., v_K] for per column quantization. If the input is N-D tensor with shape [D1, D2, M, K] then zero point and scale tensor may have shape [D1, D2, M, 1] for per row quantization and shape [D1, D2, 1, K] for per column quantization. Production must never overflow, and accumulation may overflow if and only if in 32 bits. ### Inputs - **a** (heterogeneous) - **T1**: N-dimensional quantized matrix a - **a_scale** (heterogeneous) - **TS**: scale of quantized input a - **a_zero_point** (heterogeneous) - **T1**: zero point of quantized input a - **b** (heterogeneous) - **T2**: N-dimensional quantized matrix b - **b_scale** (heterogeneous) - **TS**: scale of quantized input b - **b_zero_point** (heterogeneous) - **T2**: zero point of quantized input b - **y_scale** (heterogeneous) - **TS**: scale of quantized output y - **y_zero_point** (heterogeneous) - **T3**: zero point of quantized output y ### Outputs - **y** (heterogeneous) - **T3**: Quantized matrix multiply results from a * b ### Type Constraints * **TS** in ( `tensor(bfloat16)`, `tensor(float)`, `tensor(float16)` ): Constrain scales. * **T1** in ( `tensor(float8e4m3fn)`, `tensor(float8e4m3fnuz)`, `tensor(float8e5m2)`, `tensor(float8e5m2fnuz)`, `tensor(int8)`, `tensor(uint8)` ): The type of input a and its zeropoint. * **T2** in ( `tensor(float8e4m3fn)`, `tensor(float8e4m3fnuz)`, `tensor(float8e5m2)`, `tensor(float8e5m2fnuz)`, `tensor(int8)`, `tensor(uint8)` ): The type of input b and its zeropoint. * **T3** in ( `tensor(float8e4m3fn)`, `tensor(float8e4m3fnuz)`, `tensor(float8e5m2)`, `tensor(float8e5m2fnuz)`, `tensor(int8)`, `tensor(uint8)` ): The type of the output and its zeropoint. ```{toctree} text_diff_QLinearMatMul_10_21 ``` (l-onnx-op-qlinearmatmul-10)= ## QLinearMatMul - 10 ### Version - **name**: [QLinearMatMul (GitHub)](https://github.com/onnx/onnx/blob/main/docs/Operators.md#QLinearMatMul) - **domain**: `main` - **since_version**: `10` - **function**: `False` - **support_level**: `SupportType.COMMON` - **shape inference**: `True` This version of the operator has been available **since version 10**. ### Summary Matrix product that behaves like [numpy.matmul](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html). It consumes two quantized input tensors, their scales and zero points, scale and zero point of output, and computes the quantized output. The quantization formula is y = saturate((x / y_scale) + y_zero_point). For (x / y_scale), it is rounding to nearest ties to even. Refer to https://en.wikipedia.org/wiki/Rounding for details. Scale and zero point must have same shape. They must be either scalar (per tensor) or N-D tensor (per row for 'a' and per column for 'b'). Scalar refers to per tensor quantization whereas N-D refers to per row or per column quantization. If the input is 2D of shape [M, K] then zero point and scale tensor may be an M element vector [v_1, v_2, ..., v_M] for per row quantization and K element vector of shape [v_1, v_2, ..., v_K] for per column quantization. If the input is N-D tensor with shape [D1, D2, M, K] then zero point and scale tensor may have shape [D1, D2, M, 1] for per row quantization and shape [D1, D2, 1, K] for per column quantization. Production must never overflow, and accumulation may overflow if and only if in 32 bits. ### Inputs - **a** (heterogeneous) - **T1**: N-dimensional quantized matrix a - **a_scale** (heterogeneous) - **tensor(float)**: scale of quantized input a - **a_zero_point** (heterogeneous) - **T1**: zero point of quantized input a - **b** (heterogeneous) - **T2**: N-dimensional quantized matrix b - **b_scale** (heterogeneous) - **tensor(float)**: scale of quantized input b - **b_zero_point** (heterogeneous) - **T2**: zero point of quantized input b - **y_scale** (heterogeneous) - **tensor(float)**: scale of quantized output y - **y_zero_point** (heterogeneous) - **T3**: zero point of quantized output y ### Outputs - **y** (heterogeneous) - **T3**: Quantized matrix multiply results from a * b ### Type Constraints * **T1** in ( `tensor(int8)`, `tensor(uint8)` ): Constrain input a and its zero point data type to 8-bit integer tensor. * **T2** in ( `tensor(int8)`, `tensor(uint8)` ): Constrain input b and its zero point data type to 8-bit integer tensor. * **T3** in ( `tensor(int8)`, `tensor(uint8)` ): Constrain output y and its zero point data type to 8-bit integer tensor.