Issues when switching to float¶

Most models in scikit-learn do computation with double, not float. Most models in deep learning use float because that’s the most common situation with GPU. ONNX was initially created to facilitate the deployment of deep learning models and that explains why many converters assume the converted models should use float. That assumption does not usually harm the predictions, the conversion to float introduce small discrepencies compare to double predictions. That assumption is usually true if the prediction function is continuous, \(y = f(x)\), then \(dy = f'(x) dx\). We can determine an upper bound to the discrepencies : \(\Delta(y) \leqslant \sup_x \left\Vert f'(x)\right\Vert dx\). dx is the discrepency introduced by a float conversion, dx = x - numpy.float32(x).

However, that’s not the case for every model. A decision tree trained for a regression is not a continuous function. Therefore, even a small dx may introduce a huge discrepency. Let’s look into an example which always produces discrepencies and some ways to overcome this situation.

More into the issue¶

The below example is built to fail. It contains integer features with different order of magnitude rounded to integer. A decision tree compares features to thresholds. In most cases, float and double comparison gives the same result. We denote \([x]_{f32}\) the conversion (or cast) numpy.float32(x).

\[x \leqslant y = [x]_{f32} \leqslant [y]_{f32}\]

However, the probability that both comparisons give different results is not null. The following graph shows the discord areas.

from skl2onnx.sklapi import CastTransformer
from skl2onnx import to_onnx
from onnxruntime import InferenceSession
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_regression
import numpy
import matplotlib.pyplot as plt


def area_mismatch_rule(N, delta, factor, rule=None):
    if rule is None:

        def rule(t):
            return numpy.float32(t)

    xst = []
    yst = []
    xsf = []
    ysf = []
    for x in range(-N, N):
        for y in range(-N, N):
            dx = (1.0 + x * delta) * factor
            dy = (1.0 + y * delta) * factor
            c1 = 1 if numpy.float64(dx) <= numpy.float64(dy) else 0
            c2 = 1 if numpy.float32(dx) <= rule(dy) else 0
            key = abs(c1 - c2)
            if key == 1:
                xsf.append(dx)
                ysf.append(dy)
            else:
                xst.append(dx)
                yst.append(dy)
    return xst, yst, xsf, ysf


delta = 36e-10
factor = 1
xst, yst, xsf, ysf = area_mismatch_rule(100, delta, factor)


fig, ax = plt.subplots(1, 1, figsize=(5, 5))
ax.plot(xst, yst, ".", label="agree")
ax.plot(xsf, ysf, ".", label="disagree")
ax.set_title("Region where x <= y and (float)x <= (float)y agree")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.plot([min(xst), max(xst)], [min(yst), max(yst)], "k--")
ax.legend()

Region where x <= y and (float)x <= (float)y agree

<matplotlib.legend.Legend object at 0x7d300011ce00>

The pipeline and the data¶

We can now build an example where the learned decision tree does many comparisons in this discord area. This is done by rounding features to integers, a frequent case happening when dealing with categorical features.

X, y = make_regression(10000, 10)
X_train, X_test, y_train, y_test = train_test_split(X, y)

Xi_train, yi_train = X_train.copy(), y_train.copy()
Xi_test, yi_test = X_test.copy(), y_test.copy()
for i in range(X.shape[1]):
    Xi_train[:, i] = (Xi_train[:, i] * 2**i).astype(numpy.int64)
    Xi_test[:, i] = (Xi_test[:, i] * 2**i).astype(numpy.int64)

max_depth = 10

model = Pipeline(
    [("scaler", StandardScaler()), ("dt", DecisionTreeRegressor(max_depth=max_depth))]
)

model.fit(Xi_train, yi_train)

Pipeline(steps=[('scaler', StandardScaler()),
                ('dt', DecisionTreeRegressor(max_depth=10))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The discrepencies¶

Let’s reuse the function implemented in the first example Comparison and look into the conversion.

def diff(p1, p2):
    p1 = p1.ravel()
    p2 = p2.ravel()
    d = numpy.abs(p2 - p1)
    return d.max(), (d / numpy.abs(p1)).max()


onx = to_onnx(model, Xi_train[:1].astype(numpy.float32), target_opset=15)

sess = InferenceSession(onx.SerializeToString(), providers=["CPUExecutionProvider"])

X32 = Xi_test.astype(numpy.float32)

skl = model.predict(X32)
ort = sess.run(None, {"X": X32})[0]

print(diff(skl, ort))

(np.float64(150.75409036808895), np.float64(0.9914355799148548))

The discrepencies are significant. The ONNX model keeps float at every step.

In scikit-learn:

CastTransformer¶

We could try to use double everywhere. Unfortunately, ONNX ML Operators only allows float coefficients for the operator TreeEnsembleRegressor. We may want to compromise by casting the output of the normalizer into float in the scikit-learn pipeline.

model2 = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("cast", CastTransformer()),
        ("dt", DecisionTreeRegressor(max_depth=max_depth)),
    ]
)

model2.fit(Xi_train, yi_train)

Pipeline(steps=[('scaler', StandardScaler()), ('cast', CastTransformer()),
                ('dt', DecisionTreeRegressor(max_depth=10))])

The discrepencies.

onx2 = to_onnx(model2, Xi_train[:1].astype(numpy.float32), target_opset=15)

sess2 = InferenceSession(onx2.SerializeToString(), providers=["CPUExecutionProvider"])

skl2 = model2.predict(X32)
ort2 = sess2.run(None, {"X": X32})[0]

print(diff(skl2, ort2))

(np.float64(150.75409036808895), np.float64(0.9914355799148548))

That still fails because the normalizer in scikit-learn and in ONNX use different types. The cast still happens and the dx is still here. To remove it, we need to use double in ONNX normalizer.

model3 = Pipeline(
    [
        ("cast64", CastTransformer(dtype=numpy.float64)),
        ("scaler", StandardScaler()),
        ("cast", CastTransformer()),
        ("dt", DecisionTreeRegressor(max_depth=max_depth)),
    ]
)

model3.fit(Xi_train, yi_train)
onx3 = to_onnx(
    model3,
    Xi_train[:1].astype(numpy.float32),
    options={StandardScaler: {"div": "div_cast"}},
    target_opset=15,
)

sess3 = InferenceSession(onx3.SerializeToString(), providers=["CPUExecutionProvider"])

skl3 = model3.predict(X32)
ort3 = sess3.run(None, {"X": X32})[0]

print(diff(skl3, ort3))

(np.float64(3.0233519396460906e-05), np.float64(5.855332980488693e-08))

It works. That also means that it is difficult to change the computation type when a pipeline includes a discontinuous function. It is better to keep the same types all along before using a decision tree.

Total running time of the script: (0 minutes 0.486 seconds)

Gallery generated by Sphinx-Gallery

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.	[('scaler', ...), ('dt', ...)]
	transform_input transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6	None
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

	copy copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.	True
	with_mean with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.	True
	with_std with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).	True

	criterion criterion: {"squared_error", "friedman_mse", "absolute_error", "poisson"}, default="squared_error" The function to measure the quality of a split. Supported criteria are "squared_error" for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, "friedman_mse", which uses mean squared error with Friedman's improvement score for potential splits, "absolute_error" for the mean absolute error, which minimizes the L1 loss using the median of each terminal node, and "poisson" which uses reduction in the half mean Poisson deviance to find splits. .. versionadded:: 0.18 Mean Absolute Error (MAE) criterion. .. versionadded:: 0.24 Poisson deviance criterion.	'squared_error'
	splitter splitter: {"best", "random"}, default="best" The strategy used to choose the split at each node. Supported strategies are "best" to choose the best split and "random" to choose the best random split.	'best'
	max_depth max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. For an example of how ``max_depth`` influences the model, see :ref:`sphx_glr_auto_examples_tree_plot_tree_regression.py`.	10
	min_samples_split min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for fractions.	2
	min_samples_leaf min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for fractions.	1
	min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.	0.0
	max_features max_features: int, float or {"sqrt", "log2"}, default=None The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at each split. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.	None
	random_state random_state: int, RandomState instance or None, default=None Controls the randomness of the estimator. The features are always randomly permuted at each split, even if ``splitter`` is set to ``"best"``. When ``max_features < n_features``, the algorithm will select ``max_features`` at random at each split before finding the best split among them. But the best found split may vary across different runs, even if ``max_features=n_features``. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, ``random_state`` has to be fixed to an integer. See :term:`Glossary ` for details.	None
	max_leaf_nodes max_leaf_nodes: int, default=None Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.	None
	min_impurity_decrease min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19	0.0
	ccp_alpha ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed. See :ref:`minimal_cost_complexity_pruning` for details. See :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py` for an example of such pruning. .. versionadded:: 0.22	0.0
	monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None Indicates the monotonicity constraint to enforce on each feature. - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If monotonic_cst is None, no constraints are applied. Monotonicity constraints are not supported for: - multioutput regressions (i.e. when `n_outputs_ > 1`), - regressions trained on data with missing values. Read more in the :ref:`User Guide `. .. versionadded:: 1.4	None

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.	[('scaler', ...), ('cast', ...), ...]
	transform_input transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6	None
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

	copy copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.	True
	with_mean with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.	True
	with_std with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).	True