(l-onnx-doc-BatchNormalization)=

# BatchNormalization


(l-onnx-op-batchnormalization-15)=

## BatchNormalization - 15

### Version

- **name**: [BatchNormalization (GitHub)](https://github.com/onnx/onnx/blob/main/docs/Operators.md#BatchNormalization)
- **domain**: `main`
- **since_version**: `15`
- **function**: `False`
- **support_level**: `SupportType.COMMON`
- **shape inference**: `True`

This version of the operator has been available
**since version 15**.

### Summary

Carries out batch normalization as described in the paper
https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
There are five required inputs 'X', 'scale', 'B', 'input_mean' and
'input_var'.
Note that 'input_mean' and 'input_var' are expected to be the estimated
statistics in inference mode (training_mode=False, default),
and the running statistics in training mode (training_mode=True).
There are multiple cases for the number of outputs, which we list below:

* Output case #1: Y, running_mean, running_var (training_mode=True)
* Output case #2: Y (training_mode=False)

When training_mode=False, extra outputs are invalid.
The outputs are updated as follows when training_mode=True:
```
running_mean = input_mean * momentum + current_mean * (1 - momentum)
running_var = input_var * momentum + current_var * (1 - momentum)

Y = (X - current_mean) / sqrt(current_var + epsilon) * scale + B
```
where:
```
current_mean = ReduceMean(X, axis=all_except_channel_index)
current_var =  ReduceVar(X, axis=all_except_channel_index)
```
Notice that `ReduceVar` refers to the population variance, and it equals to
`sum(sqrd(x_i - x_avg)) / N`
where `N` is the population size (this formula does not use sample size `N - 1`).

The computation of ReduceMean and ReduceVar uses float to avoid overflow for float16 inputs.

When training_mode=False:
```
Y = (X - input_mean) / sqrt(input_var + epsilon) * scale + B
```

For previous (depreciated) non-spatial cases, implementors are suggested
to flatten the input shape to (N x C * D1 * D2 * ... * Dn) before a BatchNormalization Op.
This operator has **optional** inputs/outputs. See [ONNX IR](https://github.com/onnx/onnx/blob/main/docs/IR.md) for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.

### Attributes

* **epsilon - FLOAT** (default is `'1e-05'`):

  The epsilon value to use to avoid division by zero.

* **momentum - FLOAT** (default is `'0.9'`):

  Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum).

* **training_mode - INT** (default is `'0'`):

  If set to true, it indicates BatchNormalization is being used for training, and outputs 1 and 2 are to be computed.

### Inputs

- **X** (heterogeneous) - **T**:

  Input data tensor from the previous operator; dimensions are in the form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is the number of channels. Statistics are computed for every channel of C over N and D1 to Dn dimensions. For image data, input dimensions become (N x C x H x W). The op also accepts single dimension input of size N in which case C is assumed to be 1
- **scale** (heterogeneous) - **T1**:

  Scale tensor of shape (C).
- **B** (heterogeneous) - **T1**:

  Bias tensor of shape (C).
- **input_mean** (heterogeneous) - **T2**:

  running (training) or estimated (testing) mean tensor of shape (C).
- **input_var** (heterogeneous) - **T2**:

  running (training) or estimated (testing) variance tensor of shape (C).

### Outputs

Between 1 and 3 outputs.

- **Y** (heterogeneous) - **T**:

  The output tensor of the same shape as X
- **running_mean** (optional, heterogeneous) - **T2**:

  The running mean after the BatchNormalization operator.
- **running_var** (optional, heterogeneous) - **T2**:

  The running variance after the BatchNormalization operator. This op uses the population size (N) for calculating variance, and not the sample size N-1.

### Type Constraints

* **T** in ( `tensor(bfloat16)`, `tensor(double)`, `tensor(float)`, `tensor(float16)` ):

  Constrain input and output types to float tensors.
* **T1** in ( `tensor(bfloat16)`, `tensor(double)`, `tensor(float)`, `tensor(float16)` ):

  Constrain scale and bias types to float tensors.
* **T2** in ( `tensor(bfloat16)`, `tensor(double)`, `tensor(float)`, `tensor(float16)` ):

  Constrain mean and variance types to float tensors.

```{toctree}
text_diff_BatchNormalization_14_15
```

(l-onnx-op-batchnormalization-14)=

## BatchNormalization - 14

### Version

- **name**: [BatchNormalization (GitHub)](https://github.com/onnx/onnx/blob/main/docs/Operators.md#BatchNormalization)
- **domain**: `main`
- **since_version**: `14`
- **function**: `False`
- **support_level**: `SupportType.COMMON`
- **shape inference**: `True`

This version of the operator has been available
**since version 14**.

### Summary

Carries out batch normalization as described in the paper
https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
There are five required inputs 'X', 'scale', 'B', 'input_mean' and
'input_var'.
Note that 'input_mean' and 'input_var' are expected to be the estimated
statistics in inference mode (training_mode=False, default),
and the running statistics in training mode (training_mode=True).
There are multiple cases for the number of outputs, which we list below:

Output case #1: Y, running_mean, running_var (training_mode=True)
Output case #2: Y (training_mode=False)

When training_mode=False, extra outputs are invalid.
The outputs are updated as follows when training_mode=True:
```
running_mean = input_mean * momentum + current_mean * (1 - momentum)
running_var = input_var * momentum + current_var * (1 - momentum)

Y = (X - current_mean) / sqrt(current_var + epsilon) * scale + B

where:

current_mean = ReduceMean(X, axis=all_except_channel_index)
current_var =  ReduceVar(X, axis=all_except_channel_index)

Notice that ReduceVar refers to the population variance, and it equals to
sum(sqrd(x_i - x_avg)) / N
where N is the population size (this formula does not use sample size N - 1).

```

When training_mode=False:
```
Y = (X - input_mean) / sqrt(input_var + epsilon) * scale + B
```

For previous (depreciated) non-spatial cases, implementors are suggested
to flatten the input shape to (N x C * D1 * D2 * ... * Dn) before a BatchNormalization Op.
This operator has **optional** inputs/outputs. See [ONNX IR](https://github.com/onnx/onnx/blob/main/docs/IR.md) for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.

### Attributes

* **epsilon - FLOAT** (default is `'1e-05'`):

  The epsilon value to use to avoid division by zero.

* **momentum - FLOAT** (default is `'0.9'`):

  Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum).

* **training_mode - INT** (default is `'0'`):

  If set to true, it indicates BatchNormalization is being used for training, and outputs 1, 2, 3, and 4 would be populated.

### Inputs

- **X** (heterogeneous) - **T**:

  Input data tensor from the previous operator; dimensions are in the form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is the number of channels. Statistics are computed for every channel of C over N and D1 to Dn dimensions. For image data, input dimensions become (N x C x H x W). The op also accepts single dimension input of size N in which case C is assumed to be 1
- **scale** (heterogeneous) - **T**:

  Scale tensor of shape (C).
- **B** (heterogeneous) - **T**:

  Bias tensor of shape (C).
- **input_mean** (heterogeneous) - **U**:

  running (training) or estimated (testing) mean tensor of shape (C).
- **input_var** (heterogeneous) - **U**:

  running (training) or estimated (testing) variance tensor of shape (C).

### Outputs

Between 1 and 3 outputs.

- **Y** (heterogeneous) - **T**:

  The output tensor of the same shape as X
- **running_mean** (optional, heterogeneous) - **U**:

  The running mean after the BatchNormalization operator.
- **running_var** (optional, heterogeneous) - **U**:

  The running variance after the BatchNormalization operator. This op uses the population size (N) for calculating variance, and not the sample size N-1.

### Type Constraints

* **T** in ( `tensor(bfloat16)`, `tensor(double)`, `tensor(float)`, `tensor(float16)` ):

  Constrain input and output types to float tensors.
* **U** in ( `tensor(bfloat16)`, `tensor(double)`, `tensor(float)`, `tensor(float16)` ):

  Constrain mean and variance types to float tensors. It allows all float type for U.

```{toctree}
text_diff_BatchNormalization_9_15
text_diff_BatchNormalization_9_14
```

(l-onnx-op-batchnormalization-9)=

## BatchNormalization - 9

### Version

- **name**: [BatchNormalization (GitHub)](https://github.com/onnx/onnx/blob/main/docs/Operators.md#BatchNormalization)
- **domain**: `main`
- **since_version**: `9`
- **function**: `False`
- **support_level**: `SupportType.COMMON`
- **shape inference**: `True`

This version of the operator has been available
**since version 9**.

### Summary

Carries out batch normalization as described in the paper
https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
there are multiple cases for the number of outputs, which we list below:

Output case #1: Y, mean, var, saved_mean, saved_var (training mode)
Output case #2: Y (test mode)

For previous (depreciated) non-spatial cases, implementors are suggested
to flatten the input shape to (N x C*D1*D2 ..*Dn) before a BatchNormalization Op.
This operator has **optional** inputs/outputs. See [ONNX IR](https://github.com/onnx/onnx/blob/main/docs/IR.md) for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.

### Attributes

* **epsilon - FLOAT** (default is `'1e-05'`):

  The epsilon value to use to avoid division by zero.

* **momentum - FLOAT** (default is `'0.9'`):

  Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum).

### Inputs

- **X** (heterogeneous) - **T**:

  Input data tensor from the previous operator; dimensions are in the form of (N x C x D1 x D2 ... Dn), where N is the batch size, C is the number of channels. Statistics are computed for every channel of C over N and D1 to Dn dimensions. For image data, input dimensions become (N x C x H x W). The op also accepts single dimension input of size N in which case C is assumed to be 1
- **scale** (heterogeneous) - **T**:

  Scale tensor of shape (C).
- **B** (heterogeneous) - **T**:

  Bias tensor of shape (C).
- **mean** (heterogeneous) - **T**:

  running (training) or estimated (testing) mean tensor of shape (C).
- **var** (heterogeneous) - **T**:

  running (training) or estimated (testing) variance tensor of shape (C).

### Outputs

Between 1 and 5 outputs.

- **Y** (heterogeneous) - **T**:

  The output tensor of the same shape as X
- **mean** (optional, heterogeneous) - **T**:

  The running mean after the BatchNormalization operator.
- **var** (optional, heterogeneous) - **T**:

  The running variance after the BatchNormalization operator.
- **saved_mean** (optional, heterogeneous) - **T**:

  Saved mean used during training to speed up gradient computation.
- **saved_var** (optional, heterogeneous) - **T**:

  Saved variance used during training to speed up gradient computation.

### Type Constraints

* **T** in ( `tensor(double)`, `tensor(float)`, `tensor(float16)` ):

  Constrain input and output types to float tensors.

```{toctree}
text_diff_BatchNormalization_7_15
text_diff_BatchNormalization_7_14
text_diff_BatchNormalization_7_9
```

(l-onnx-op-batchnormalization-7)=

## BatchNormalization - 7

### Version

- **name**: [BatchNormalization (GitHub)](https://github.com/onnx/onnx/blob/main/docs/Operators.md#BatchNormalization)
- **domain**: `main`
- **since_version**: `7`
- **function**: `False`
- **support_level**: `SupportType.COMMON`
- **shape inference**: `True`

This version of the operator has been available
**since version 7**.

### Summary

Carries out batch normalization as described in the paper
https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
there are multiple cases for the number of outputs, which we list below:

Output case #1: Y, mean, var, saved_mean, saved_var (training mode)
Output case #2: Y (test mode)
    This operator has **optional** inputs/outputs. See [ONNX IR](https://github.com/onnx/onnx/blob/main/docs/IR.md) for more details about the representation of optional arguments. An empty string may be used in the place of an actual argument's name to indicate a missing argument. Trailing optional arguments (those not followed by an argument that is present) may also be simply omitted.

### Attributes

* **epsilon - FLOAT** (default is `'1e-05'`):

  The epsilon value to use to avoid division by zero.

* **momentum - FLOAT** (default is `'0.9'`):

  Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum).

* **spatial - INT** (default is `'1'`):

  If true, compute the mean and variance across per activation. If false, compute the mean and variance across per feature over each mini-batch.

### Inputs

- **X** (heterogeneous) - **T**:

  Input data tensor from the previous operator; dimensions for image case are (N x C x H x W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data. For non image case, the dimensions are in the form of (N x C x D1 x D2 ... Dn), where N is the batch size.
- **scale** (heterogeneous) - **T**:

  If spatial is true, the dimension of scale is (C). If spatial is false, the dimensions of scale are (C x D1 x ... x Dn)
- **B** (heterogeneous) - **T**:

  If spatial is true, the dimension of bias is (C). If spatial is false, the dimensions of bias are (C x D1 x ... x Dn)
- **mean** (heterogeneous) - **T**:

  If spatial is true, the dimension of the running mean (training) or the estimated mean (testing) is (C). If spatial is false, the dimensions of the running mean (training) or the estimated mean (testing) are (C x D1 x ... x Dn).
- **var** (heterogeneous) - **T**:

  If spatial is true, the dimension of the running variance(training) or the estimated variance (testing) is (C). If spatial is false, the dimensions of the running variance(training) or the estimated variance (testing) are (C x D1 x ... x Dn).

### Outputs

Between 1 and 5 outputs.

- **Y** (heterogeneous) - **T**:

  The output tensor of the same shape as X
- **mean** (optional, heterogeneous) - **T**:

  The running mean after the BatchNormalization operator.
- **var** (optional, heterogeneous) - **T**:

  The running variance after the BatchNormalization operator.
- **saved_mean** (optional, heterogeneous) - **T**:

  Saved mean used during training to speed up gradient computation.
- **saved_var** (optional, heterogeneous) - **T**:

  Saved variance used during training to speed up gradient computation.

### Type Constraints

* **T** in ( `tensor(double)`, `tensor(float)`, `tensor(float16)` ):

  Constrain input and output types to float tensors.

```{toctree}
text_diff_BatchNormalization_6_15
text_diff_BatchNormalization_6_14
text_diff_BatchNormalization_6_9
text_diff_BatchNormalization_6_7
```

(l-onnx-op-batchnormalization-6)=

## BatchNormalization - 6

### Version

- **name**: [BatchNormalization (GitHub)](https://github.com/onnx/onnx/blob/main/docs/Operators.md#BatchNormalization)
- **domain**: `main`
- **since_version**: `6`
- **function**: `False`
- **support_level**: `SupportType.COMMON`
- **shape inference**: `True`

This version of the operator has been available
**since version 6**.

### Summary

Carries out batch normalization as described in the paper
https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
there are multiple cases for the number of outputs, which we list below:

Output case #1: Y, mean, var, saved_mean, saved_var (training mode)
Output case #2: Y (test mode)

### Attributes

* **epsilon - FLOAT** (default is `'1e-05'`):

  The epsilon value to use to avoid division by zero, default is 1e-5f.

* **is_test - INT** (default is `'0'`):

  If set to nonzero, run spatial batch normalization in test mode, default is 0.

* **momentum - FLOAT** (default is `'0.9'`):

  Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum), default is 0.9f.

* **spatial - INT** (default is `'1'`):

  If true, compute the mean and variance across all spatial elements If false, compute the mean and variance across per feature.Default is 1.

### Inputs

- **X** (heterogeneous) - **T**:

  Input data tensor from the previous operator; dimensions for image case are (N x C x H x W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data. For non image case, the dimensions are in the form of (N x C x D1 x D2 ... Dn), where N is the batch size.
- **scale** (heterogeneous) - **T**:

  The scale as a 1-dimensional tensor of size C to be applied to the output.
- **B** (heterogeneous) - **T**:

  The bias as a 1-dimensional tensor of size C to be applied to the output.
- **mean** (heterogeneous) - **T**:

  The running mean (training) or the estimated mean (testing) as a 1-dimensional tensor of size C.
- **var** (heterogeneous) - **T**:

  The running variance (training) or the estimated variance (testing) as a 1-dimensional tensor of size C.

### Outputs

Between 1 and 5 outputs.

- **Y** (heterogeneous) - **T**:

  The output tensor of the same shape as X.
- **mean** (optional, heterogeneous) - **T**:

  The running mean after the BatchNormalization operator. Must be in-place with the input mean. Should not be used for testing.
- **var** (optional, heterogeneous) - **T**:

  The running variance after the BatchNormalization operator. Must be in-place with the input var. Should not be used for testing.
- **saved_mean** (optional, heterogeneous) - **T**:

  Saved mean used during training to speed up gradient computation. Should not be used for testing.
- **saved_var** (optional, heterogeneous) - **T**:

  Saved variance used during training to speed up gradient computation. Should not be used for testing.

### Type Constraints

* **T** in ( `tensor(double)`, `tensor(float)`, `tensor(float16)` ):

  Constrain input and output types to float tensors.

```{toctree}
text_diff_BatchNormalization_1_15
text_diff_BatchNormalization_1_14
text_diff_BatchNormalization_1_9
text_diff_BatchNormalization_1_7
text_diff_BatchNormalization_1_6
```

(l-onnx-op-batchnormalization-1)=

## BatchNormalization - 1

### Version

- **name**: [BatchNormalization (GitHub)](https://github.com/onnx/onnx/blob/main/docs/Operators.md#BatchNormalization)
- **domain**: `main`
- **since_version**: `1`
- **function**: `False`
- **support_level**: `SupportType.COMMON`
- **shape inference**: `False`

This version of the operator has been available
**since version 1**.

### Summary

Carries out batch normalization as described in the paper
https://arxiv.org/abs/1502.03167. Depending on the mode it is being run,
there are multiple cases for the number of outputs, which we list below:

Output case #1: Y, mean, var, saved_mean, saved_var (training mode)
Output case #2: Y (test mode)

### Attributes

* **consumed_inputs - INTS** (required) :

  legacy optimization attribute.

* **epsilon - FLOAT** (default is `'1e-05'`):

  The epsilon value to use to avoid division by zero, default is 1e-5f.

* **is_test - INT** (default is `'0'`):

  If set to nonzero, run spatial batch normalization in test mode, default is 0.

* **momentum - FLOAT** (default is `'0.9'`):

  Factor used in computing the running mean and variance.e.g., running_mean = running_mean * momentum + mean * (1 - momentum), default is 0.9f.

* **spatial - INT** (default is `'1'`):

  If true, compute the mean and variance across all spatial elements If false, compute the mean and variance across per feature.Default is 1.

### Inputs

- **X** (heterogeneous) - **T**:

  The input 4-dimensional tensor of shape NCHW.
- **scale** (heterogeneous) - **T**:

  The scale as a 1-dimensional tensor of size C to be applied to the output.
- **B** (heterogeneous) - **T**:

  The bias as a 1-dimensional tensor of size C to be applied to the output.
- **mean** (heterogeneous) - **T**:

  The running mean (training) or the estimated mean (testing) as a 1-dimensional tensor of size C.
- **var** (heterogeneous) - **T**:

  The running variance (training) or the estimated variance (testing) as a 1-dimensional tensor of size C.

### Outputs

Between 1 and 5 outputs.

- **Y** (heterogeneous) - **T**:

  The output 4-dimensional tensor of the same shape as X.
- **mean** (optional, heterogeneous) - **T**:

  The running mean after the BatchNormalization operator. Must be in-place with the input mean. Should not be used for testing.
- **var** (optional, heterogeneous) - **T**:

  The running variance after the BatchNormalization operator. Must be in-place with the input var. Should not be used for testing.
- **saved_mean** (optional, heterogeneous) - **T**:

  Saved mean used during training to speed up gradient computation. Should not be used for testing.
- **saved_var** (optional, heterogeneous) - **T**:

  Saved variance used during training to speed up gradient computation. Should not be used for testing.

### Type Constraints

* **T** in ( `tensor(double)`, `tensor(float)`, `tensor(float16)` ):

  Constrain input and output types to float tensors.