(onnxdetailfloat4)=
# Float stored in 4 bits
## Papers
4 bit floating point formats have emerged as a solution to the
rising cost and deployment challenges of large language models.
The S1E2M1 format has been part of the [Open Compute Project (OCP)](https://www.opencompute.org/documents/ocpmicroscalingformatsmxv10specfinalpdf)
standard.
As a result, a new data type was introduced in `onnx==1.18.0`
to support a limited set of operators to enable computation
with float4.
 `FLOAT4E2M1`: 1 bit for the sign, 2 bits for the exponents, and 1 bit for the mantissa.
No nan or infinities.
## E2M1
$S$ stands for the sign. $10_2$ describe a number base 2.
```{evalrst}
.. listtable:: Float4 type
:widths: 10 10
:headerrows: 1
* 
 E2M1
*  Exponent bias
 1
*  Infinities

*  NaN

*  Zeros
 :math:`S.00.0_2`
*  Max
 :math:`S.11.1_2`
*  Min
 :math:`S.00.1_2 = 2^{1}`
```
Let's denote the bit representation as $S.b_2 b_1 b_0$.
The float value is defined by the following expressions:
```{evalrst}
.. listtable:: Float4 type values
:widths: 10 10
:headerrows: 1
* 
 E2M1
*  exponent :math:`\neq` 0
 :math:`(1)^S 2^{\sum_{i=1}^2 b_i 2^{i1}  1} \left( 1 + b_0 2^{1} \right)`
*  exponent :math:`=` 0
 :math:`(1)^S b_0 2^{1}`
```
The following table lists all the representable values by float4 E2M1, ignoring the sign bit:
```{evalrst}
.. listtable:: Float4 type values
:widths: 10 10
:headerrows: 1
*  bits (ignoring sign bit)
 E2M1
*  000
 0
*  001
 0.5
*  010
 1
*  011
 1.5
*  100
 2
*  101
 3
*  110
 4
*  111
 6
```
## Cast
Upcasting from float4 to float32, float16, bfloat16, and float8 is exact.
The behavior for downcasting to float 4 is summarized below
 x  E2M1 
    
 6<=x<=6  E2M1 converted value of x. Round to nearest even. 
 x=+/0  +/0 
 x>6  6 
 x<6  6 
 +Inf  6 
 Inf  6 
 NaN  6 
## Packing and Unpacking
Float4 is stored as 2x4bit in a single byte.
The first element is stored in the 4 LSB and the second element is stored in the 4 MSB,
i.e. for elements `x` and `y` that are consecutive elements in the array:
```
pack(x,y): y << 4  x & 0x0F
unpack(z): x = z & 0x0F, y = z >> 4
```
In case the total number of elements is odd, padding of 4 bits will be appended.
The storage size of a 4 bit tensor of size `N` is `ceil(N/2)`.