QLinearMatMul - 10 vs 21¶
Next section compares an older to a newer version of the same operator after both definition are converted into markdown text. Green means an addition to the newer version, red means a deletion. Anything else is unchanged.
QLinearMatMul10 → QLinearMatMul21
RENAMED
@@ -1 +1 @@
|
|
1
1
|
Matrix product that behaves like [numpy.matmul](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html).
|
2
2
|
It consumes two quantized input tensors, their scales and zero points, scale and zero point of output,
|
3
3
|
and computes the quantized output. The quantization formula is y = saturate((x / y_scale) + y_zero_point).
|
4
4
|
For (x / y_scale), it is rounding to nearest ties to even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
|
5
5
|
Scale and zero point must have same shape. They must be either scalar (per tensor) or N-D tensor
|
6
6
|
(per row for 'a' and per column for 'b'). Scalar refers to per tensor quantization whereas N-D refers to per row
|
7
7
|
or per column quantization. If the input is 2D of shape [M, K] then zero point and scale tensor may be
|
8
8
|
an M element vector [v_1, v_2, ..., v_M] for per row quantization and K element vector of shape [v_1, v_2, ..., v_K]
|
9
9
|
for per column quantization. If the input is N-D tensor with shape [D1, D2, M, K] then zero point and scale tensor may
|
10
10
|
have shape [D1, D2, M, 1] for per row quantization and shape [D1, D2, 1, K] for per column quantization.
|
11
11
|
Production must never overflow, and accumulation may overflow if and only if in 32 bits.
|
12
12
|
### Inputs
|
13
13
|
- **a** (heterogeneous) - **T1**:
|
14
14
|
N-dimensional quantized matrix a
|
15
|
-
- **a_scale** (heterogeneous) - **
|
15
|
+
- **a_scale** (heterogeneous) - **TS**:
|
16
16
|
scale of quantized input a
|
17
17
|
- **a_zero_point** (heterogeneous) - **T1**:
|
18
18
|
zero point of quantized input a
|
19
19
|
- **b** (heterogeneous) - **T2**:
|
20
20
|
N-dimensional quantized matrix b
|
21
|
-
- **b_scale** (heterogeneous) - **
|
21
|
+
- **b_scale** (heterogeneous) - **TS**:
|
22
22
|
scale of quantized input b
|
23
23
|
- **b_zero_point** (heterogeneous) - **T2**:
|
24
24
|
zero point of quantized input b
|
25
|
-
- **y_scale** (heterogeneous) - **
|
25
|
+
- **y_scale** (heterogeneous) - **TS**:
|
26
26
|
scale of quantized output y
|
27
27
|
- **y_zero_point** (heterogeneous) - **T3**:
|
28
28
|
zero point of quantized output y
|
29
29
|
### Outputs
|
30
30
|
- **y** (heterogeneous) - **T3**:
|
31
31
|
Quantized matrix multiply results from a * b
|
32
32
|
### Type Constraints
|
33
|
-
* **
|
33
|
+
* **TS** in ( tensor(bfloat16), tensor(float), tensor(float16) ):
|
34
|
-
Constrain
|
34
|
+
Constrain scales.
|
35
|
-
* **
|
35
|
+
* **T1** in ( tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int8), tensor(uint8) ):
|
36
|
-
|
36
|
+
The type of input a and its zeropoint.
|
37
|
-
* **
|
37
|
+
* **T2** in ( tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int8), tensor(uint8) ):
|
38
|
-
|
38
|
+
* **T3** in ( tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int8), tensor(uint8) ):
|
39
|
+
|
40
|
+
The type of the output and its zeropoint.
|