GroupNormalization - 18 vs 21¶
Next section compares an older to a newer version of the same operator after both definition are converted into markdown text. Green means an addition to the newer version, red means a deletion. Anything else is unchanged.
GroupNormalization18 → GroupNormalization21
RENAMED
@@ -1 +1 @@
|
|
1
1
|
A GroupNormalization function. Carries out group normalization as described in
|
2
2
|
the paper https://arxiv.org/abs/1803.08494
|
3
3
|
This operator transforms input according to
|
4
4
|
y = scale * (x - mean) / sqrt(variance + epsilon) + bias,
|
5
5
|
where the mean and variance are computed per instance per group of channels, and
|
6
|
-
scale and bias should be specified for each
|
6
|
+
scale and bias should be specified for each channel. The number of
|
7
7
|
groups num_groups should be divisible by the number of channels so that there are
|
8
8
|
an equal number of channels per group.
|
9
|
+
|
10
|
+
The overall computation has two stages: the first stage normalizes the elements to
|
11
|
+
have zero mean and unit variance for each instance in each group, and the second
|
12
|
+
stage scales and shifts the results of the first stage. The floating-point precision
|
13
|
+
used in the first stage is determined by the stash_type attribute. For example,
|
14
|
+
if stash_type is 1, the operator casts all input variables to 32-bit float,
|
15
|
+
performs the computation, and finally casts the normalized results back to the
|
16
|
+
original type of X. The second stage does not depend on stash_type.
|
9
17
|
When the number of groups is the same as the number of channels, this operator is
|
10
18
|
equivalent to InstanceNormalization. When there is only one group, this operator
|
11
19
|
is equivalent to LayerNormalization.
|
12
20
|
### Attributes
|
13
21
|
* **epsilon - FLOAT** (default is '1e-05'):
|
14
22
|
The epsilon value to use to avoid division by zero.
|
15
23
|
* **num_groups - INT** (required) :
|
16
24
|
The number of groups of channels. It should be a divisor of the number of channels C.
|
25
|
+
* **stash_type - INT** (default is '1'):
|
26
|
+
|
27
|
+
The floating-point precision used in stage one of the computation.
|
28
|
+
|
17
29
|
### Inputs
|
18
30
|
- **X** (heterogeneous) - **T**:
|
19
31
|
Input data tensor. Dimensions for image cases are (N x C x H x W), where N is the batch size, C is the number of channels, and H and W are the height and width of the data. Statistics are computed for every group of channels over C, H, and W. For non-image cases, the dimensions are in the form of (N x C x D1 x D2 ... Dn).
|
20
32
|
- **scale** (heterogeneous) - **T**:
|
21
|
-
Scale tensor of shape (
|
33
|
+
Scale tensor of shape (C).
|
22
34
|
- **bias** (heterogeneous) - **T**:
|
23
|
-
Bias tensor of shape (
|
35
|
+
Bias tensor of shape (C).
|
24
36
|
### Outputs
|
25
37
|
- **Y** (heterogeneous) - **T**:
|
26
38
|
The output tensor of the same shape as X.
|
27
39
|
### Type Constraints
|
28
40
|
* **T** in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ):
|
29
41
|
Constrain input and output types to float tensors.
|