BatchNormalization - 6 vs 14

Next section compares an older to a newer version of the same operator after both definition are converted into markdown text. Green means an addition to the newer version, red means a deletion. Anything else is unchanged.

BatchNormalization6 → BatchNormalization14 RENAMED
@@ -1 +1 @@
1
1
  Carries out batch normalization as described in the paper
2
2
  https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
3
+ There are five required inputs 'X', 'scale', 'B', 'input_mean' and
4
+ 'input_var'.
5
+ Note that 'input_mean' and 'input_var' are expected to be the estimated
6
+ statistics in inference mode (training_mode=False, default),
7
+ and the running statistics in training mode (training_mode=True).
3
- there are multiple cases for the number of outputs, which we list below:
8
+ There are multiple cases for the number of outputs, which we list below:
4
- Output case #1: Y, mean, var, saved_mean, saved_var (training mode)
5
- Output case #2: Y (test mode)
9
+ Output case #1: Y, running_mean, running_var (training_mode=True)
10
+ Output case #2: Y (training_mode=False)
11
+
12
+ When training_mode=False, extra outputs are invalid.
13
+ The outputs are updated as follows when training_mode=True:
14
+
15
+ running_mean = input_mean * momentum + current_mean * (1 - momentum)
16
+ running_var = input_var * momentum + current_var * (1 - momentum)
17
+
18
+ Y = (X - current_mean) / sqrt(current_var + epsilon) * scale + B
19
+
20
+ where:
21
+
22
+ current_mean = ReduceMean(X, axis=all_except_channel_index)
23
+ current_var = ReduceVar(X, axis=all_except_channel_index)
24
+
25
+ Notice that ReduceVar refers to the population variance, and it equals to
26
+ sum(sqrd(x_i - x_avg)) / N
27
+ where N is the population size (this formula does not use sample size N - 1).
28
+
29
+
30
+
31
+ When training_mode=False:
32
+
33
+ Y = (X - input_mean) / sqrt(input_var + epsilon) * scale + B
34
+
35
+
36
+ For previous (depreciated) non-spatial cases, implementors are suggested
37
+ to flatten the input shape to (N x C * D1 * D2 * ... * Dn) before a BatchNormalization Op.
38
+ This operator has **optional** inputs/outputs. See [ONNX IR](https://github.com/onnx/onnx/blob/main/docs/IR.md) for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.
6
39
  ### Attributes
7
40
  * **epsilon - FLOAT** (default is '1e-05'):
8
- The epsilon value to use to avoid division by zero, default is 1e-5f.
41
+ The epsilon value to use to avoid division by zero.
9
-
10
- * **is_test - INT** (default is '0'):
11
-
12
- If set to nonzero, run spatial batch normalization in test mode, default is 0.
13
42
  * **momentum - FLOAT** (default is '0.9'):
14
- Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum), default is 0.9f.
43
+ Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum).
15
- * **spatial - INT** (default is '1'):
44
+ * **training_mode - INT** (default is '0'):
16
- If true, compute the mean and variance across all spatial elements If false, compute the mean and variance across per feature.Default is 1.
45
+ If set to true, it indicates BatchNormalization is being used for training, and outputs 1, 2, 3, and 4 would be populated.
17
46
  ### Inputs
18
47
  - **X** (heterogeneous) - **T**:
19
- Input data tensor from the previous operator; dimensions for image case are (N x C x H x W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data. For non image case, the dimensions are in the form of (N x C x D1 x D2 ... Dn), where N is the batch size.
48
+ Input data tensor from the previous operator; dimensions are in the form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is the number of channels. Statistics are computed for every channel of C over N and D1 to Dn dimensions. For image data, input dimensions become (N x C x H x W). The op also accepts single dimension input of size N in which case C is assumed to be 1
20
49
  - **scale** (heterogeneous) - **T**:
21
- The scale as a 1-dimensional tensor of size C to be applied to the output.
50
+ Scale tensor of shape (C).
22
51
  - **B** (heterogeneous) - **T**:
23
- The bias as a 1-dimensional tensor of size C to be applied to the output.
52
+ Bias tensor of shape (C).
24
- - **mean** (heterogeneous) - **T**:
53
+ - **input_mean** (heterogeneous) - **U**:
25
- The running mean (training) or the estimated mean (testing) as a 1-dimensional tensor of size C.
54
+ running (training) or estimated (testing) mean tensor of shape (C).
26
- - **var** (heterogeneous) - **T**:
55
+ - **input_var** (heterogeneous) - **U**:
27
- The running variance (training) or the estimated variance (testing) as a 1-dimensional tensor of size C.
56
+ running (training) or estimated (testing) variance tensor of shape (C).
28
57
  ### Outputs
29
- Between 1 and 5 outputs.
58
+ Between 1 and 3 outputs.
30
59
  - **Y** (heterogeneous) - **T**:
31
- The output tensor of the same shape as X.
60
+ The output tensor of the same shape as X
32
- - **mean** (optional, heterogeneous) - **T**:
61
+ - **running_mean** (optional, heterogeneous) - **U**:
33
- The running mean after the BatchNormalization operator. Must be in-place with the input mean. Should not be used for testing.
62
+ The running mean after the BatchNormalization operator.
34
- - **var** (optional, heterogeneous) - **T**:
63
+ - **running_var** (optional, heterogeneous) - **U**:
64
+ The running variance after the BatchNormalization operator. This op uses the population size (N) for calculating variance, and not the sample size N-1.
35
- The running variance after the BatchNormalization operator. Must be in-place with the input var. Should not be used for testing.
36
- - **saved_mean** (optional, heterogeneous) - **T**:
37
-
38
- Saved mean used during training to speed up gradient computation. Should not be used for testing.
39
- - **saved_var** (optional, heterogeneous) - **T**:
40
-
41
- Saved variance used during training to speed up gradient computation. Should not be used for testing.
42
65
  ### Type Constraints
43
- * **T** in ( tensor(double), tensor(float), tensor(float16) ):
66
+ * **T** in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ):
67
+ * **U** in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ):
68
+
44
- Constrain input and output types to float tensors.+ Constrain input and output types to float tensors.
69
+ Constrain mean and variance types to float tensors. It allows all float type for U.