BatchNormalization - 7 vs 15¶

Next section compares an older to a newer version of the same operator after both definition are converted into markdown text. Green means an addition to the newer version, red means a deletion. Anything else is unchanged.

Files changed (1) hide show

BatchNormalization7 → BatchNormalization15 +58 -27

BatchNormalization7 → BatchNormalization15 RENAMED Viewed

@@ -1 +1 @@
  Carries out batch normalization as described in the paper
  https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
+ There are five required inputs 'X', 'scale', 'B', 'input_mean' and
+ 'input_var'.
+ Note that 'input_mean' and 'input_var' are expected to be the estimated
+ statistics in inference mode (training_mode=False, default),
+ and the running statistics in training mode (training_mode=True).
- there are multiple cases for the number of outputs, which we list below:
+ There are multiple cases for the number of outputs, which we list below:
- Output case #1: Y, mean, var, saved_mean, saved_var (training mode)
+ * Output case #1: Y, running_mean, running_var (training_mode=True)
- Output case #2: Y (test mode)
+ * Output case #2: Y (training_mode=False)
+ When training_mode=False, extra outputs are invalid.
+ The outputs are updated as follows when training_mode=True:
+ running_mean = input_mean * momentum + current_mean * (1 - momentum)
+ running_var = input_var * momentum + current_var * (1 - momentum)
+ Y = (X - current_mean) / sqrt(current_var + epsilon) * scale + B
+ where:
+ current_mean = ReduceMean(X, axis=all_except_channel_index)
+ current_var =  ReduceVar(X, axis=all_except_channel_index)
+ Notice that ReduceVar refers to the population variance, and it equals to
+ sum(sqrd(x_i - x_avg)) / N
+ where N is the population size (this formula does not use sample size N - 1).
+ The computation of ReduceMean and ReduceVar uses float to avoid overflow for float16 inputs.
+ When training_mode=False:
+ Y = (X - input_mean) / sqrt(input_var + epsilon) * scale + B
+ For previous (depreciated) non-spatial cases, implementors are suggested
+ to flatten the input shape to (N x C * D1 * D2 * ... * Dn) before a BatchNormalization Op.
-     This operator has **optional** inputs/outputs. See [ONNX IR](https://github.com/onnx/onnx/blob/main/docs/IR.md) for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.
+ This operator has **optional** inputs/outputs. See [ONNX IR](https://github.com/onnx/onnx/blob/main/docs/IR.md) for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.
  ### Attributes
  * **epsilon - FLOAT** (default is '1e-05'):
    The epsilon value to use to avoid division by zero.
  * **momentum - FLOAT** (default is '0.9'):
    Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum).
- * **spatial - INT** (default is '1'):
+ * **training_mode - INT** (default is '0'):
-   If true, compute the mean and variance across per activation. If false, compute the mean and variance across per feature over each mini-batch.
+   If set to true, it indicates BatchNormalization is being used for training, and outputs 1 and 2 are to be computed.
  ### Inputs
  - **X** (heterogeneous) - **T**:
-   Input data tensor from the previous operator; dimensions for image case are (N x C x H x W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data. For non image case, the dimensions are in the form of (N x C x D1 x D2 ... Dn), where N is the batch size.
+   Input data tensor from the previous operator; dimensions are in the form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is the number of channels. Statistics are computed for every channel of C over N and D1 to Dn dimensions. For image data, input dimensions become (N x C x H x W). The op also accepts single dimension input of size N in which case C is assumed to be 1
- - **scale** (heterogeneous) - **T**:
+ - **scale** (heterogeneous) - **T1**:
-   If spatial is true, the dimension of scale is (C). If spatial is false, the dimensions of scale are (C x D1 x ... x Dn)
+   Scale tensor of shape (C).
- - **B** (heterogeneous) - **T**:
+ - **B** (heterogeneous) - **T1**:
-   If spatial is true, the dimension of bias is (C). If spatial is false, the dimensions of bias are (C x D1 x ... x Dn)
+   Bias tensor of shape (C).
- - **mean** (heterogeneous) - **T**:
+ - **input_mean** (heterogeneous) - **T2**:
-   If spatial is true, the dimension of the running mean (training) or the estimated mean (testing) is (C). If spatial is false, the dimensions of the running mean (training) or the estimated mean (testing) are (C x D1 x ... x Dn).
+   running (training) or estimated (testing) mean tensor of shape (C).
- - **var** (heterogeneous) - **T**:
+ - **input_var** (heterogeneous) - **T2**:
-   If spatial is true, the dimension of the running variance(training) or the estimated variance (testing) is (C). If spatial is false, the dimensions of the running variance(training) or the estimated variance (testing) are (C x D1 x ... x Dn).
+   running (training) or estimated (testing) variance tensor of shape (C).
  ### Outputs
- Between 1 and 5 outputs.
+ Between 1 and 3 outputs.
  - **Y** (heterogeneous) - **T**:
    The output tensor of the same shape as X
- - **mean** (optional, heterogeneous) - **T**:
+ - **running_mean** (optional, heterogeneous) - **T2**:
    The running mean after the BatchNormalization operator.
- - **var** (optional, heterogeneous) - **T**:
+ - **running_var** (optional, heterogeneous) - **T2**:
+   The running variance after the BatchNormalization operator. This op uses the population size (N) for calculating variance, and not the sample size N-1.
-   The running variance after the BatchNormalization operator.
- - **saved_mean** (optional, heterogeneous) - **T**:
-   Saved mean used during training to speed up gradient computation.
- - **saved_var** (optional, heterogeneous) - **T**:
-   Saved variance used during training to speed up gradient computation.
  ### Type Constraints
- * **T** in ( tensor(double), tensor(float), tensor(float16) ):
+ * **T** in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ):
+ * **T1** in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ):
+   Constrain scale and bias types to float tensors.
+ * **T2** in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ):
-   Constrain input and output types to float tensors.+   Constrain input and output types to float tensors.
+   Constrain mean and variance types to float tensors.