.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_tutorial/plot_gexternal_lightgbm_reg.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_tutorial_plot_gexternal_lightgbm_reg.py: .. _example-lightgbm-reg: Convert a pipeline with a LightGBM regressor ============================================ .. index:: LightGBM The discrepancies observed when using float and TreeEnsemble operator (see :ref:`l-example-discrepencies-float-double`) explains why the converter for *LGBMRegressor* may introduce significant discrepancies even when it is used with float tensors. Library *lightgbm* is implemented with double. A random forest regressor with multiple trees computes its prediction by adding the prediction of every tree. After being converting into ONNX, this summation becomes :math:`\left[\sum\right]_{i=1}^F float(T_i(x))`, where *F* is the number of trees in the forest, :math:`T_i(x)` the output of tree *i* and :math:`\left[\sum\right]` a float addition. The discrepancy can be expressed as :math:`D(x) = |\left[\sum\right]_{i=1}^F float(T_i(x)) - \sum_{i=1}^F T_i(x)|`. This grows with the number of trees in the forest. To reduce the impact, an option was added to split the node *TreeEnsembleRegressor* into multiple ones and to do a summation with double this time. If we assume the node if split into *a* nodes, the discrepancies then become :math:`D'(x) = |\sum_{k=1}^a \left[\sum\right]_{i=1}^{F/a} float(T_{ak + i}(x)) - \sum_{i=1}^F T_i(x)|`. Train a LGBMRegressor +++++++++++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 38-66 .. code-block:: Python import packaging.version as pv import warnings import timeit import numpy from pandas import DataFrame import matplotlib.pyplot as plt from tqdm import tqdm from lightgbm import LGBMRegressor from onnxruntime import InferenceSession from skl2onnx import to_onnx, update_registered_converter from skl2onnx.common.shape_calculator import ( calculate_linear_regressor_output_shapes, ) # noqa from onnxmltools import __version__ as oml_version from onnxmltools.convert.lightgbm.operator_converters.LightGbm import ( convert_lightgbm, ) # noqa N = 1000 X = numpy.random.randn(N, 20) y = numpy.random.randn(N) + numpy.random.randn(N) * 100 * numpy.random.randint( 0, 1, 1000 ) reg = LGBMRegressor(n_estimators=1000) reg.fit(X, y) .. rst-class:: sphx-glr-script-out .. code-block:: none [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000276 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 5100 [LightGBM] [Info] Number of data points in the train set: 1000, number of used features: 20 [LightGBM] [Info] Start training from score 0.001127 .. raw:: html
LGBMRegressor(n_estimators=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 67-78 Register the converter for LGBMClassifier +++++++++++++++++++++++++++++++++++++++++ The converter is implemented in :epkg:`onnxmltools`: `onnxmltools...LightGbm.py `_. and the shape calculator: `onnxmltools...Regressor.py `_. .. GENERATED FROM PYTHON SOURCE LINES 78-102 .. code-block:: Python def skl2onnx_convert_lightgbm(scope, operator, container): options = scope.get_options(operator.raw_operator) if "split" in options: if pv.Version(oml_version) < pv.Version("1.9.2"): warnings.warn( "Option split was released in version 1.9.2 but %s is " "installed. It will be ignored." % oml_version ) operator.split = options["split"] else: operator.split = None convert_lightgbm(scope, operator, container) update_registered_converter( LGBMRegressor, "LightGbmLGBMRegressor", calculate_linear_regressor_output_shapes, skl2onnx_convert_lightgbm, options={"split": None}, ) .. GENERATED FROM PYTHON SOURCE LINES 103-109 Convert +++++++ We convert the same model following the two scenarios, one single TreeEnsembleRegressor node, or more. *split* parameter is the number of trees per node TreeEnsembleRegressor. .. GENERATED FROM PYTHON SOURCE LINES 109-120 .. code-block:: Python model_onnx = to_onnx( reg, X[:1].astype(numpy.float32), target_opset={"": 14, "ai.onnx.ml": 2} ) model_onnx_split = to_onnx( reg, X[:1].astype(numpy.float32), target_opset={"": 14, "ai.onnx.ml": 2}, options={"split": 100}, ) .. GENERATED FROM PYTHON SOURCE LINES 121-123 Discrepancies +++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 123-142 .. code-block:: Python sess = InferenceSession( model_onnx.SerializeToString(), providers=["CPUExecutionProvider"] ) sess_split = InferenceSession( model_onnx_split.SerializeToString(), providers=["CPUExecutionProvider"] ) X32 = X.astype(numpy.float32) expected = reg.predict(X32) got = sess.run(None, {"X": X32})[0].ravel() got_split = sess_split.run(None, {"X": X32})[0].ravel() disp = numpy.abs(got - expected).sum() disp_split = numpy.abs(got_split - expected).sum() print("sum of discrepancies 1 node", disp) print("sum of discrepancies split node", disp_split, "ratio:", disp / disp_split) .. rst-class:: sphx-glr-script-out .. code-block:: none sum of discrepancies 1 node 0.00020644343950206685 sum of discrepancies split node 4.144931004458315e-05 ratio: 4.980624268052108 .. GENERATED FROM PYTHON SOURCE LINES 143-145 The sum of the discrepancies were reduced 4, 5 times. The maximum is much better too. .. GENERATED FROM PYTHON SOURCE LINES 145-152 .. code-block:: Python disc = numpy.abs(got - expected).max() disc_split = numpy.abs(got_split - expected).max() print("max discrepancies 1 node", disc) print("max discrepancies split node", disc_split, "ratio:", disc / disc_split) .. rst-class:: sphx-glr-script-out .. code-block:: none max discrepancies 1 node 1.985479140209634e-06 max discrepancies split node 2.6622454418756547e-07 ratio: 7.457911689805682 .. GENERATED FROM PYTHON SOURCE LINES 153-157 Processing time +++++++++++++++ The processing time is slower but not much. .. GENERATED FROM PYTHON SOURCE LINES 157-167 .. code-block:: Python print( "processing time no split", timeit.timeit(lambda: sess.run(None, {"X": X32})[0], number=150), ) print( "processing time split", timeit.timeit(lambda: sess_split.run(None, {"X": X32})[0], number=150), ) .. rst-class:: sphx-glr-script-out .. code-block:: none processing time no split 2.342391199999838 processing time split 2.7244762999998784 .. GENERATED FROM PYTHON SOURCE LINES 168-173 Split influence +++++++++++++++ Let's see how the sum of the discrepancies moves against the parameter *split*. .. GENERATED FROM PYTHON SOURCE LINES 173-193 .. code-block:: Python res = [] for i in tqdm(list(range(20, 170, 20)) + [200, 300, 400, 500]): model_onnx_split = to_onnx( reg, X[:1].astype(numpy.float32), target_opset={"": 14, "ai.onnx.ml": 2}, options={"split": i}, ) sess_split = InferenceSession( model_onnx_split.SerializeToString(), providers=["CPUExecutionProvider"] ) got_split = sess_split.run(None, {"X": X32})[0].ravel() disc_split = numpy.abs(got_split - expected).max() res.append(dict(split=i, disc=disc_split)) df = DataFrame(res).set_index("split") df["baseline"] = disc print(df) .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/12 [00:00` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_gexternal_lightgbm_reg.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_