Float stored in 4 bits¶

Papers¶

4 bit floating point formats have emerged as a solution to the rising cost and deployment challenges of large language models. The S1E2M1 format has been part of the Open Compute Project (OCP) standard.

As a result, a new data type was introduced in onnx==1.18.0 to support a limited set of operators to enable computation with float4.

FLOAT4E2M1: 1 bit for the sign, 2 bits for the exponents, and 1 bit for the mantissa. No nan or infinities.

E2M1¶

\(S\) stands for the sign. \(10_2\) describe a number base 2.

Float4 type¶
	E2M1
Exponent bias	1
Infinities
NaN
Zeros	\(S.00.0_2\)
Max	\(S.11.1_2\)
Min	\(S.00.1_2 = 2^{-1}\)

Let’s denote the bit representation as \(S.b_2 b_1 b_0\). The float value is defined by the following expressions:

Float4 type values¶
	E2M1
exponent \(\neq\) 0	\((-1)^S 2^{\sum_{i=1}^2 b_i 2^{i-1} - 1} \left( 1 + b_0 2^{-1} \right)\)
exponent \(=\) 0	\((-1)^S b_0 2^{-1}\)

The following table lists all the representable values by float4 E2M1, ignoring the sign bit:

Float4 type values¶
bits (ignoring sign bit)	E2M1
000	0
001	0.5
010	1
011	1.5
100	2
101	3
110	4
111	6

Cast¶

Upcasting from float4 to float32, float16, bfloat16, and float8 is exact. The behavior for downcasting to float 4 is summarized below

x	E2M1
-6<=x<=6	E2M1 converted value of x. Round to nearest even.
x=+/-0	+/-0
x>6	6
x<-6	-6
+Inf	6
-Inf	-6
NaN	6

Packing and Unpacking¶

Float4 is stored as 2x4bit in a single byte. The first element is stored in the 4 LSB and the second element is stored in the 4 MSB, i.e. for elements x and y that are consecutive elements in the array:

pack(x,y): y << 4 | x & 0x0F
unpack(z): x = z & 0x0F, y = z >> 4

In case the total number of elements is odd, padding of 4 bits will be appended. The storage size of a 4 bit tensor of size N is ceil(N/2).