.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_tutorial/plot_gbegin_dataframe.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_tutorial_plot_gbegin_dataframe.py: Dataframe as an input ===================== .. index:: dataframe A pipeline usually ingests data as a matrix. It may be converted in a matrix if all the data share the same type. But data held in a dataframe have usually multiple types, float, integer or string for categories. ONNX also supports that case. A dataset with categories +++++++++++++++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 18-54 .. code-block:: Python import numpy import pprint from onnxruntime import InferenceSession from pandas import DataFrame from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder from sklearn.ensemble import RandomForestClassifier from skl2onnx import to_onnx from skl2onnx.algebra.type_helper import guess_initial_types data = DataFrame( [ dict(CAT1="a", CAT2="c", num1=0.5, num2=0.6, y=0), dict(CAT1="b", CAT2="d", num1=0.4, num2=0.8, y=1), dict(CAT1="a", CAT2="d", num1=0.5, num2=0.56, y=0), dict(CAT1="a", CAT2="d", num1=0.55, num2=0.56, y=1), dict(CAT1="a", CAT2="c", num1=0.35, num2=0.86, y=0), dict(CAT1="a", CAT2="c", num1=0.5, num2=0.68, y=1), ] ) cat_cols = ["CAT1", "CAT2"] train_data = data.drop("y", axis=1) categorical_transformer = Pipeline( [("onehot", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))] ) preprocessor = ColumnTransformer( transformers=[("cat", categorical_transformer, cat_cols)], remainder="passthrough" ) pipe = Pipeline([("preprocess", preprocessor), ("rf", RandomForestClassifier())]) pipe.fit(train_data, data["y"]) .. raw:: html
Pipeline(steps=[('preprocess',
                     ColumnTransformer(remainder='passthrough',
                                       transformers=[('cat',
                                                      Pipeline(steps=[('onehot',
                                                                       OneHotEncoder(handle_unknown='ignore',
                                                                                     sparse_output=False))]),
                                                      ['CAT1', 'CAT2'])])),
                    ('rf', RandomForestClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 55-59 Conversion to ONNX ++++++++++++++++++ Function *to_onnx* does not handle dataframes. .. GENERATED FROM PYTHON SOURCE LINES 59-63 .. code-block:: Python onx = to_onnx(pipe, train_data[:1], options={RandomForestClassifier: {"zipmap": False}}) .. GENERATED FROM PYTHON SOURCE LINES 64-68 Prediction with ONNX ++++++++++++++++++++ *onnxruntime* does not support dataframes. .. GENERATED FROM PYTHON SOURCE LINES 68-104 .. code-block:: Python sess = InferenceSession(onx.SerializeToString(), providers=["CPUExecutionProvider"]) try: sess.run(None, train_data) except Exception as e: print(e) # Unhide conversion logic with a dataframe # ++++++++++++++++++++++++++++++++++++++++ # # A dataframe can be seen as a set of columns with # different types. That's what ONNX should see: # a list of inputs, the input name is the column name, # the input type is the column type. def guess_schema_from_data(X): init = guess_initial_types(X) unique = set() for _, col in init: if len(col.shape) != 2: return init if col.shape[0] is not None: return init if len(unique) > 0 and col.__class__ not in unique: return init unique.add(col.__class__) unique = list(unique) return [("X", unique[0]([None, sum(_[1].shape[1] for _ in init)]))] init = guess_schema_from_data(train_data) pprint.pprint(init) .. rst-class:: sphx-glr-script-out .. code-block:: none run(): incompatible function arguments. The following argument types are supported: 1. (self: onnxruntime.capi.onnxruntime_pybind11_state.InferenceSession, arg0: List[str], arg1: Dict[str, object], arg2: onnxruntime.capi.onnxruntime_pybind11_state.RunOptions) -> List[object] Invoked with: , ['label', 'probabilities'], CAT1 CAT2 num1 num2 0 a c 0.50 0.60 1 b d 0.40 0.80 2 a d 0.50 0.56 3 a d 0.55 0.56 4 a c 0.35 0.86 5 a c 0.50 0.68, None [('CAT1', StringTensorType(shape=[None, 1])), ('CAT2', StringTensorType(shape=[None, 1])), ('num1', DoubleTensorType(shape=[None, 1])), ('num2', DoubleTensorType(shape=[None, 1]))] .. GENERATED FROM PYTHON SOURCE LINES 105-106 Let's use float instead. .. GENERATED FROM PYTHON SOURCE LINES 106-116 .. code-block:: Python for c in train_data.columns: if c not in cat_cols: train_data[c] = train_data[c].astype(numpy.float32) init = guess_schema_from_data(train_data) pprint.pprint(init) .. rst-class:: sphx-glr-script-out .. code-block:: none [('CAT1', StringTensorType(shape=[None, 1])), ('CAT2', StringTensorType(shape=[None, 1])), ('num1', FloatTensorType(shape=[None, 1])), ('num2', FloatTensorType(shape=[None, 1]))] .. GENERATED FROM PYTHON SOURCE LINES 117-118 Let's convert with *skl2onnx* only. .. GENERATED FROM PYTHON SOURCE LINES 118-123 .. code-block:: Python onx2 = to_onnx( pipe, initial_types=init, options={RandomForestClassifier: {"zipmap": False}} ) .. GENERATED FROM PYTHON SOURCE LINES 124-128 Let's run it with onnxruntime. We need to convert the dataframe into a dictionary where column names become keys, and column values become values. .. GENERATED FROM PYTHON SOURCE LINES 128-132 .. code-block:: Python inputs = {c: train_data[c].values.reshape((-1, 1)) for c in train_data.columns} pprint.pprint(inputs) .. rst-class:: sphx-glr-script-out .. code-block:: none {'CAT1': array([['a'], ['b'], ['a'], ['a'], ['a'], ['a']], dtype=object), 'CAT2': array([['c'], ['d'], ['d'], ['d'], ['c'], ['c']], dtype=object), 'num1': array([[0.5 ], [0.4 ], [0.5 ], [0.55], [0.35], [0.5 ]], dtype=float32), 'num2': array([[0.6 ], [0.8 ], [0.56], [0.56], [0.86], [0.68]], dtype=float32)} .. GENERATED FROM PYTHON SOURCE LINES 133-134 Inference. .. GENERATED FROM PYTHON SOURCE LINES 134-142 .. code-block:: Python sess2 = InferenceSession(onx2.SerializeToString(), providers=["CPUExecutionProvider"]) got2 = sess2.run(None, inputs) print(pipe.predict(train_data)) print(got2[0]) .. rst-class:: sphx-glr-script-out .. code-block:: none [0 1 0 1 0 1] [0 1 0 1 0 1] .. GENERATED FROM PYTHON SOURCE LINES 143-144 And probilities. .. GENERATED FROM PYTHON SOURCE LINES 144-147 .. code-block:: Python print(pipe.predict_proba(train_data)) print(got2[1]) .. rst-class:: sphx-glr-script-out .. code-block:: none [[0.82 0.18] [0.26 0.74] [0.76 0.24] [0.37 0.63] [0.75 0.25] [0.29 0.71]] [[0.82 0.18 ] [0.2600004 0.7399996 ] [0.76 0.24000004] [0.3700003 0.6299997 ] [0.75 0.25000003] [0.29000038 0.7099996 ]] .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.412 seconds) .. _sphx_glr_download_auto_tutorial_plot_gbegin_dataframe.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_gbegin_dataframe.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_gbegin_dataframe.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_