Issues when switching to float

Most models in scikit-learn do computation with double, not float. Most models in deep learning use float because that’s the most common situation with GPU. ONNX was initially created to facilitate the deployment of deep learning models and that explains why many converters assume the converted models should use float. That assumption does not usually harm the predictions, the conversion to float introduce small discrepencies compare to double predictions. That assumption is usually true if the prediction function is continuous, y = f(x), then dy = f'(x) dx. We can determine an upper bound to the discrepencies : \Delta(y) \leqslant \sup_x \left\Vert f'(x)\right\Vert dx. dx is the discrepency introduced by a float conversion, dx = x - numpy.float32(x).

However, that’s not the case for every model. A decision tree trained for a regression is not a continuous function. Therefore, even a small dx may introduce a huge discrepency. Let’s look into an example which always produces discrepencies and some ways to overcome this situation.

More into the issue

The below example is built to fail. It contains integer features with different order of magnitude rounded to integer. A decision tree compares features to thresholds. In most cases, float and double comparison gives the same result. We denote [x]_{f32} the conversion (or cast) numpy.float32(x).

x \leqslant y = [x]_{f32} \leqslant [y]_{f32}

However, the probability that both comparisons give different results is not null. The following graph shows the discord areas.

from mlprodict.sklapi import OnnxPipeline
from skl2onnx.sklapi import CastTransformer, CastRegressor
from skl2onnx import to_onnx
from mlprodict.onnx_conv import to_onnx as to_onnx_extended
from mlprodict.onnxrt import OnnxInference
from onnxruntime import InferenceSession
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_regression
import numpy
import matplotlib.pyplot as plt


def area_mismatch_rule(N, delta, factor, rule=None):
    if rule is None:
        def rule(t): return numpy.float32(t)
    xst = []
    yst = []
    xsf = []
    ysf = []
    for x in range(-N, N):
        for y in range(-N, N):
            dx = (1. + x * delta) * factor
            dy = (1. + y * delta) * factor
            c1 = 1 if numpy.float64(dx) <= numpy.float64(dy) else 0
            c2 = 1 if numpy.float32(dx) <= rule(dy) else 0
            key = abs(c1 - c2)
            if key == 1:
                xsf.append(dx)
                ysf.append(dy)
            else:
                xst.append(dx)
                yst.append(dy)
    return xst, yst, xsf, ysf


delta = 36e-10
factor = 1
xst, yst, xsf, ysf = area_mismatch_rule(100, delta, factor)


fig, ax = plt.subplots(1, 1, figsize=(5, 5))
ax.plot(xst, yst, '.', label="agree")
ax.plot(xsf, ysf, '.', label="disagree")
ax.set_title("Region where x <= y and (float)x <= (float)y agree")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.plot([min(xst), max(xst)], [min(yst), max(yst)], 'k--')
ax.legend()
Region where x <= y and (float)x <= (float)y agree

Out:

<matplotlib.legend.Legend object at 0x7fd6f67ca048>

The pipeline and the data

We can now build an example where the learned decision tree does many comparisons in this discord area. This is done by rounding features to integers, a frequent case happening when dealing with categorical features.

X, y = make_regression(10000, 10)
X_train, X_test, y_train, y_test = train_test_split(X, y)

Xi_train, yi_train = X_train.copy(), y_train.copy()
Xi_test, yi_test = X_test.copy(), y_test.copy()
for i in range(X.shape[1]):
    Xi_train[:, i] = (Xi_train[:, i] * 2 ** i).astype(numpy.int64)
    Xi_test[:, i] = (Xi_test[:, i] * 2 ** i).astype(numpy.int64)

max_depth = 10

model = Pipeline([
    ('scaler', StandardScaler()),
    ('dt', DecisionTreeRegressor(max_depth=max_depth))
])

model.fit(Xi_train, yi_train)

Out:

Pipeline(steps=[('scaler', StandardScaler()),
                ('dt', DecisionTreeRegressor(max_depth=10))])

The discrepencies

Let’s reuse the function implemented in the first example Comparison and look into the conversion.

def diff(p1, p2):
    p1 = p1.ravel()
    p2 = p2.ravel()
    d = numpy.abs(p2 - p1)
    return d.max(), (d / numpy.abs(p1)).max()


onx = to_onnx(model, Xi_train[:1].astype(numpy.float32))

sess = InferenceSession(onx.SerializeToString())

X32 = Xi_test.astype(numpy.float32)

skl = model.predict(X32)
ort = sess.run(None, {'X': X32})[0]

print(diff(skl, ort))

Out:

(127.97818359360296, 3.4285463128770672)

The discrepencies are significant. The ONNX model keeps float at every step.

In scikit-learn:

CastTransformer

We could try to use double everywhere. Unfortunately, ONNX ML Operators only allows float coefficients for the operator TreeEnsembleRegressor. We may want to compromise by casting the output of the normalizer into float in the scikit-learn pipeline.

model2 = Pipeline([
    ('scaler', StandardScaler()),
    ('cast', CastTransformer()),
    ('dt', DecisionTreeRegressor(max_depth=max_depth))
])

model2.fit(Xi_train, yi_train)

Out:

Pipeline(steps=[('scaler', StandardScaler()), ('cast', CastTransformer()),
                ('dt', DecisionTreeRegressor(max_depth=10))])

The discrepencies.

onx2 = to_onnx(model2, Xi_train[:1].astype(numpy.float32))

sess2 = InferenceSession(onx2.SerializeToString())

skl2 = model2.predict(X32)
ort2 = sess2.run(None, {'X': X32})[0]

print(diff(skl2, ort2))

Out:

(127.97818359360296, 3.4285463128770672)

That still fails because the normalizer in scikit-learn and in ONNX use different types. The cast still happens and the dx is still here. To remove it, we need to use double in ONNX normalizer.

model3 = Pipeline([
    ('cast64', CastTransformer(dtype=numpy.float64)),
    ('scaler', StandardScaler()),
    ('cast', CastTransformer()),
    ('dt', DecisionTreeRegressor(max_depth=max_depth))
])

model3.fit(Xi_train, yi_train)
onx3 = to_onnx(model3, Xi_train[:1].astype(numpy.float32),
               options={StandardScaler: {'div': 'div_cast'}})

sess3 = InferenceSession(onx3.SerializeToString())

skl3 = model3.predict(X32)
ort3 = sess3.run(None, {'X': X32})[0]

print(diff(skl3, ort3))

Out:

(2.8443492055885145e-05, 5.629382317213401e-08)

It works. That also means that it is difficult to change the computation type when a pipeline includes a discontinuous function. It is better to keep the same types all along before using a decision tree.

Sledgehammer

The idea here is to always train the next step based on ONNX outputs. That way, every step of the pipeline is trained based on ONNX output.

  • Trains the first step.

  • Converts the step into ONNX

  • Computes ONNX outputs.

  • Trains the second step on these outputs.

  • Converts the second step into ONNX.

  • Merges it with the first step.

  • Computes ONNX outputs of the merged two first steps.

It is implemented in class OnnxPipeline.

model_onx = OnnxPipeline([
    ('scaler', StandardScaler()),
    ('dt', DecisionTreeRegressor(max_depth=max_depth))
])

model_onx.fit(Xi_train, yi_train)

Out:

OnnxPipeline(steps=[('scaler',
                     OnnxTransformer(onnx_bytes=b'\x08\x06\x12\x08skl2onnx\x1a\x051.7.1"\x07ai.onnx(\x002\x00:\xf6\x01\n\xa6\x01\n\x01X\x12\x08variable\x1a\x06Scaler"\x06Scaler*=\n\x06offset=o\x12\x03<=e\xcf\x8b;=\xfd@\xcc\xbc=\x9e\x14\xd3==\xe6\x89z\xbe=w-!\xbf=\x054\x91\xbe=_Nv>=G\x03(@=\x87`/\xc0\xa0\x01\x06*<\n\x05scale=\x02\xb2\xb9?=\...c\x8c>=\xa8\x1f\x07>=\x96\xdd\x82==Nf\xff<=\xdc\x96\x7f<=i\xcf\x00<=2\xa1\x80;=\x17:\x00;\xa0\x01\x06:\nai.onnx.ml\x12\x1emlprodict_ONNX(StandardScaler)Z\x11\n\x01X\x12\x0c\n\n\x08\x01\x12\x06\n\x00\n\x02\x08\nb\x18\n\x08variable\x12\x0c\n\n\x08\x01\x12\x06\n\x00\n\x02\x08\nB\x0e\n\nai.onnx.ml\x10\x01')),
                    ('dt', DecisionTreeRegressor(max_depth=10))])

The conversion.

onx4 = to_onnx(model_onx, Xi_train[:1].astype(numpy.float32))

sess4 = InferenceSession(onx4.SerializeToString())

skl4 = model_onx.predict(X32)
ort4 = sess4.run(None, {'X': X32})[0]

print(diff(skl4, ort4))

Out:

(2.335943236175808e-05, 5.629382317213401e-08)

It works too in a more simple way.

No discrepencies at all?

Is it possible to get no error at all? There is one major obstacle: scikit-learn stores the predicted values in every leave with double (_tree.pyx - _get_value_ndarray), ONNX defines the the predicted values as floats: TreeEnsembleRegressor. What can we do to solve it? What if we could extend ONNX specifications to support double instead of floats.

tree = DecisionTreeRegressor(max_depth=max_depth)
tree.fit(Xi_train, yi_train)

model_onx = to_onnx_extended(tree, Xi_train[:1].astype(numpy.float64),
                             rewrite_ops=True)

oinf5 = OnnxInference(model_onx, runtime='python_compiled')
print(oinf5)

Out:

OnnxInference(...)
    def compiled_run(dict_inputs):
        # inputs
        X = dict_inputs['X']
        (variable, ) = n0_treeensembleregressordouble(X)
        return {
            'variable': variable,
        }

Let’s measure the discrepencies.

X64 = Xi_test.astype(numpy.float64)
skl5 = tree.predict(X64)
ort5 = oinf5.run({'X': X64})['variable']

Perfect, no discrepencies at all.

print(diff(skl5, ort5))

Out:

(0.0, 0.0)

CastRegressor

The previous example demonstrated the type difference for the predicted values explains the small differences between scikit-learn and onnxruntime. But it does not with the current ONNX. Another option is to cast the the predictions into floats in the scikit-learn pipeline.

ctree = CastRegressor(DecisionTreeRegressor(max_depth=max_depth))
ctree.fit(Xi_train, yi_train)

onx6 = to_onnx(ctree, Xi_train[:1].astype(numpy.float32))

sess6 = InferenceSession(onx6.SerializeToString())

skl6 = ctree.predict(X32)
ort6 = sess6.run(None, {'X': X32})[0]

print(diff(skl6, ort6))

Out:

(0.0, 0.0)

Success!

Total running time of the script: ( 0 minutes 1.227 seconds)

Gallery generated by Sphinx-Gallery