BatchNormalization - 6 vs 14¶
Next section compares an older to a newer version of the same operator after both definition are converted into markdown text. Green means an addition to the newer version, red means a deletion. Anything else is unchanged.
BatchNormalization6 → BatchNormalization14
RENAMED
@@ -1 +1 @@
|
|
1
1
|
Carries out batch normalization as described in the paper
|
2
2
|
https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
|
3
|
+
There are five required inputs 'X', 'scale', 'B', 'input_mean' and
|
4
|
+
'input_var'.
|
5
|
+
Note that 'input_mean' and 'input_var' are expected to be the estimated
|
6
|
+
statistics in inference mode (training_mode=False, default),
|
7
|
+
and the running statistics in training mode (training_mode=True).
|
3
|
-
|
8
|
+
There are multiple cases for the number of outputs, which we list below:
|
4
|
-
Output case #1: Y,
|
5
|
-
Output case #2: Y (
|
9
|
+
Output case #1: Y, running_mean, running_var (training_mode=True)
|
10
|
+
Output case #2: Y (training_mode=False)
|
11
|
+
|
12
|
+
When training_mode=False, extra outputs are invalid.
|
13
|
+
The outputs are updated as follows when training_mode=True:
|
14
|
+
|
15
|
+
running_mean = input_mean * momentum + current_mean * (1 - momentum)
|
16
|
+
running_var = input_var * momentum + current_var * (1 - momentum)
|
17
|
+
|
18
|
+
Y = (X - current_mean) / sqrt(current_var + epsilon) * scale + B
|
19
|
+
|
20
|
+
where:
|
21
|
+
|
22
|
+
current_mean = ReduceMean(X, axis=all_except_channel_index)
|
23
|
+
current_var = ReduceVar(X, axis=all_except_channel_index)
|
24
|
+
|
25
|
+
Notice that ReduceVar refers to the population variance, and it equals to
|
26
|
+
sum(sqrd(x_i - x_avg)) / N
|
27
|
+
where N is the population size (this formula does not use sample size N - 1).
|
28
|
+
|
29
|
+
|
30
|
+
|
31
|
+
When training_mode=False:
|
32
|
+
|
33
|
+
Y = (X - input_mean) / sqrt(input_var + epsilon) * scale + B
|
34
|
+
|
35
|
+
|
36
|
+
For previous (depreciated) non-spatial cases, implementors are suggested
|
37
|
+
to flatten the input shape to (N x C * D1 * D2 * ... * Dn) before a BatchNormalization Op.
|
38
|
+
This operator has **optional** inputs/outputs. See [ONNX IR](https://github.com/onnx/onnx/blob/main/docs/IR.md) for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.
|
6
39
|
### Attributes
|
7
40
|
* **epsilon - FLOAT** (default is '1e-05'):
|
8
|
-
The epsilon value to use to avoid division by zero
|
41
|
+
The epsilon value to use to avoid division by zero.
|
9
|
-
|
10
|
-
* **is_test - INT** (default is '0'):
|
11
|
-
|
12
|
-
If set to nonzero, run spatial batch normalization in test mode, default is 0.
|
13
42
|
* **momentum - FLOAT** (default is '0.9'):
|
14
|
-
Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum)
|
43
|
+
Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum).
|
15
|
-
* **
|
44
|
+
* **training_mode - INT** (default is '0'):
|
16
|
-
If true,
|
45
|
+
If set to true, it indicates BatchNormalization is being used for training, and outputs 1, 2, 3, and 4 would be populated.
|
17
46
|
### Inputs
|
18
47
|
- **X** (heterogeneous) - **T**:
|
19
|
-
Input data tensor from the previous operator; dimensions
|
48
|
+
Input data tensor from the previous operator; dimensions are in the form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is the number of channels. Statistics are computed for every channel of C over N and D1 to Dn dimensions. For image data, input dimensions become (N x C x H x W). The op also accepts single dimension input of size N in which case C is assumed to be 1
|
20
49
|
- **scale** (heterogeneous) - **T**:
|
21
|
-
|
50
|
+
Scale tensor of shape (C).
|
22
51
|
- **B** (heterogeneous) - **T**:
|
23
|
-
|
52
|
+
Bias tensor of shape (C).
|
24
|
-
- **
|
53
|
+
- **input_mean** (heterogeneous) - **U**:
|
25
|
-
|
54
|
+
running (training) or estimated (testing) mean tensor of shape (C).
|
26
|
-
- **
|
55
|
+
- **input_var** (heterogeneous) - **U**:
|
27
|
-
|
56
|
+
running (training) or estimated (testing) variance tensor of shape (C).
|
28
57
|
### Outputs
|
29
|
-
Between 1 and
|
58
|
+
Between 1 and 3 outputs.
|
30
59
|
- **Y** (heterogeneous) - **T**:
|
31
|
-
The output tensor of the same shape as X
|
60
|
+
The output tensor of the same shape as X
|
32
|
-
- **
|
61
|
+
- **running_mean** (optional, heterogeneous) - **U**:
|
33
|
-
The running mean after the BatchNormalization operator.
|
62
|
+
The running mean after the BatchNormalization operator.
|
34
|
-
- **
|
63
|
+
- **running_var** (optional, heterogeneous) - **U**:
|
64
|
+
The running variance after the BatchNormalization operator. This op uses the population size (N) for calculating variance, and not the sample size N-1.
|
35
|
-
The running variance after the BatchNormalization operator. Must be in-place with the input var. Should not be used for testing.
|
36
|
-
- **saved_mean** (optional, heterogeneous) - **T**:
|
37
|
-
|
38
|
-
Saved mean used during training to speed up gradient computation. Should not be used for testing.
|
39
|
-
- **saved_var** (optional, heterogeneous) - **T**:
|
40
|
-
|
41
|
-
Saved variance used during training to speed up gradient computation. Should not be used for testing.
|
42
65
|
### Type Constraints
|
43
|
-
* **T** in ( tensor(double), tensor(float), tensor(float16) ):
|
66
|
+
* **T** in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ):
|
67
|
+
* **U** in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ):
|
68
|
+
|
44
|
-
Constrain
|
69
|
+
Constrain mean and variance types to float tensors. It allows all float type for U.
|