.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_tutorial/plot_gbegin_dataframe.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_tutorial_plot_gbegin_dataframe.py: Dataframe as an input ===================== .. index:: dataframe A pipeline usually ingests data as a matrix. It may be converted in a matrix if all the data share the same type. But data held in a dataframe have usually multiple types, float, integer or string for categories. ONNX also supports that case. A dataset with categories +++++++++++++++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 18-55 .. code-block:: Python import numpy import pprint from onnxruntime import InferenceSession from pandas import DataFrame from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder from sklearn.ensemble import RandomForestClassifier from skl2onnx import to_onnx from skl2onnx.algebra.type_helper import guess_initial_types data = DataFrame( [ dict(CAT1="a", CAT2="c", num1=0.5, num2=0.6, y=0), dict(CAT1="b", CAT2="d", num1=0.4, num2=0.8, y=1), dict(CAT1="a", CAT2="d", num1=0.5, num2=0.56, y=0), dict(CAT1="a", CAT2="d", num1=0.55, num2=0.56, y=1), dict(CAT1="a", CAT2="c", num1=0.35, num2=0.86, y=0), dict(CAT1="a", CAT2="c", num1=0.5, num2=0.68, y=1), ] ) cat_cols = ["CAT1", "CAT2"] train_data = data.drop("y", axis=1) categorical_transformer = Pipeline( [("onehot", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))] ) preprocessor = ColumnTransformer( transformers=[("cat", categorical_transformer, cat_cols)], remainder="passthrough" ) pipe = Pipeline([("preprocess", preprocessor), ("rf", RandomForestClassifier())]) pipe.fit(train_data, data["y"]) .. raw:: html

Pipeline(steps=[('preprocess',
                     ColumnTransformer(remainder='passthrough',
                                       transformers=[('cat',
                                                      Pipeline(steps=[('onehot',
                                                                       OneHotEncoder(handle_unknown='ignore',
                                                                                     sparse_output=False))]),
                                                      ['CAT1', 'CAT2'])])),
                    ('rf', RandomForestClassifier())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

.. GENERATED FROM PYTHON SOURCE LINES 56-60 Conversion to ONNX ++++++++++++++++++ Function *to_onnx* does not handle dataframes. .. GENERATED FROM PYTHON SOURCE LINES 60-64 .. code-block:: Python onx = to_onnx(pipe, train_data[:1], options={RandomForestClassifier: {"zipmap": False}}) .. GENERATED FROM PYTHON SOURCE LINES 65-69 Prediction with ONNX ++++++++++++++++++++ *onnxruntime* does not support dataframes. .. GENERATED FROM PYTHON SOURCE LINES 69-105 .. code-block:: Python sess = InferenceSession(onx.SerializeToString(), providers=["CPUExecutionProvider"]) try: sess.run(None, train_data) except Exception as e: print(e) # Unhide conversion logic with a dataframe # ++++++++++++++++++++++++++++++++++++++++ # # A dataframe can be seen as a set of columns with # different types. That's what ONNX should see: # a list of inputs, the input name is the column name, # the input type is the column type. def guess_schema_from_data(X): init = guess_initial_types(X) unique = set() for _, col in init: if len(col.shape) != 2: return init if col.shape[0] is not None: return init if len(unique) > 0 and col.__class__ not in unique: return init unique.add(col.__class__) unique = list(unique) return [("X", unique[0]([None, sum(_[1].shape[1] for _ in init)]))] init = guess_schema_from_data(train_data) pprint.pprint(init) .. rst-class:: sphx-glr-script-out .. code-block:: none run(): incompatible function arguments. The following argument types are supported: 1. (self: onnxruntime.capi.onnxruntime_pybind11_state.InferenceSession, arg0: list[str], arg1: dict[str, object], arg2: onnxruntime.capi.onnxruntime_pybind11_state.RunOptions) -> list Invoked with: , ['label', 'probabilities'], CAT1 CAT2 num1 num2 0 a c 0.50 0.60 1 b d 0.40 0.80 2 a d 0.50 0.56 3 a d 0.55 0.56 4 a c 0.35 0.86 5 a c 0.50 0.68, None [('CAT1', StringTensorType(shape=[None, 1])), ('CAT2', StringTensorType(shape=[None, 1])), ('num1', DoubleTensorType(shape=[None, 1])), ('num2', DoubleTensorType(shape=[None, 1]))] .. GENERATED FROM PYTHON SOURCE LINES 106-107 Let's use float instead. .. GENERATED FROM PYTHON SOURCE LINES 107-117 .. code-block:: Python for c in train_data.columns: if c not in cat_cols: train_data[c] = train_data[c].astype(numpy.float32) init = guess_schema_from_data(train_data) pprint.pprint(init) .. rst-class:: sphx-glr-script-out .. code-block:: none [('CAT1', StringTensorType(shape=[None, 1])), ('CAT2', StringTensorType(shape=[None, 1])), ('num1', FloatTensorType(shape=[None, 1])), ('num2', FloatTensorType(shape=[None, 1]))] .. GENERATED FROM PYTHON SOURCE LINES 118-119 Let's convert with *skl2onnx* only. .. GENERATED FROM PYTHON SOURCE LINES 119-124 .. code-block:: Python onx2 = to_onnx( pipe, initial_types=init, options={RandomForestClassifier: {"zipmap": False}} ) .. GENERATED FROM PYTHON SOURCE LINES 125-129 Let's run it with onnxruntime. We need to convert the dataframe into a dictionary where column names become keys, and column values become values. .. GENERATED FROM PYTHON SOURCE LINES 129-133 .. code-block:: Python inputs = {c: train_data[c].values.reshape((-1, 1)) for c in train_data.columns} pprint.pprint(inputs) .. rst-class:: sphx-glr-script-out .. code-block:: none {'CAT1': array([['a'], ['b'], ['a'], ['a'], ['a'], ['a']], dtype=object), 'CAT2': array([['c'], ['d'], ['d'], ['d'], ['c'], ['c']], dtype=object), 'num1': array([[0.5 ], [0.4 ], [0.5 ], [0.55], [0.35], [0.5 ]], dtype=float32), 'num2': array([[0.6 ], [0.8 ], [0.56], [0.56], [0.86], [0.68]], dtype=float32)} .. GENERATED FROM PYTHON SOURCE LINES 134-135 Inference. .. GENERATED FROM PYTHON SOURCE LINES 135-143 .. code-block:: Python sess2 = InferenceSession(onx2.SerializeToString(), providers=["CPUExecutionProvider"]) got2 = sess2.run(None, inputs) print(pipe.predict(train_data)) print(got2[0]) .. rst-class:: sphx-glr-script-out .. code-block:: none [0 1 0 1 0 1] [0 1 0 1 0 1] .. GENERATED FROM PYTHON SOURCE LINES 144-145 And probilities. .. GENERATED FROM PYTHON SOURCE LINES 145-148 .. code-block:: Python print(pipe.predict_proba(train_data)) print(got2[1]) .. rst-class:: sphx-glr-script-out .. code-block:: none [[0.84 0.16] [0.32 0.68] [0.68 0.32] [0.17 0.83] [0.77 0.23] [0.36 0.64]] [[0.84000003 0.16 ] [0.32000035 0.67999965] [0.68000007 0.31999996] [0.1700005 0.8299995 ] [0.77 0.23000003] [0.3600003 0.6399997 ]] .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.300 seconds) .. _sphx_glr_download_auto_tutorial_plot_gbegin_dataframe.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_gbegin_dataframe.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_gbegin_dataframe.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_gbegin_dataframe.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_