.. GENERATED FROM PYTHON SOURCE LINES 103-109 Define the inputs of the ONNX graph +++++++++++++++++++++++++++++++++++ *sklearn-onnx* does not know the features used to train the model but it needs to know which feature has which name. We simply reuse the dataframe column definition. .. GENERATED FROM PYTHON SOURCE LINES 109-111 .. code-block:: Python print(X_train.dtypes) .. rst-class:: sphx-glr-script-out .. code-block:: none pclass int64 name object sex object age float64 sibsp int64 parch int64 ticket object fare float64 cabin object embarked object boat object body float64 home.dest object dtype: object .. GENERATED FROM PYTHON SOURCE LINES 112-113 After conversion. .. GENERATED FROM PYTHON SOURCE LINES 113-134 .. code-block:: Python def convert_dataframe_schema(df, drop=None): inputs = [] for k, v in zip(df.columns, df.dtypes): if drop is not None and k in drop: continue if v == "int64": t = Int64TensorType([None, 1]) elif v == "float64": t = FloatTensorType([None, 1]) else: t = StringTensorType([None, 1]) inputs.append((k, t)) return inputs inputs = convert_dataframe_schema(X_train) pprint.pprint(inputs) .. rst-class:: sphx-glr-script-out .. code-block:: none [('pclass', Int64TensorType(shape=[None, 1])), ('name', StringTensorType(shape=[None, 1])), ('sex', StringTensorType(shape=[None, 1])), ('age', FloatTensorType(shape=[None, 1])), ('sibsp', Int64TensorType(shape=[None, 1])), ('parch', Int64TensorType(shape=[None, 1])), ('ticket', StringTensorType(shape=[None, 1])), ('fare', FloatTensorType(shape=[None, 1])), ('cabin', StringTensorType(shape=[None, 1])), ('embarked', StringTensorType(shape=[None, 1])), ('boat', StringTensorType(shape=[None, 1])), ('body', FloatTensorType(shape=[None, 1])), ('home.dest', StringTensorType(shape=[None, 1]))] .. GENERATED FROM PYTHON SOURCE LINES 135-138 Merging single column into vectors is not the most efficient way to compute the prediction. It could be done before converting the pipeline into a graph. .. GENERATED FROM PYTHON SOURCE LINES 140-142 Convert the pipeline into ONNX ++++++++++++++++++++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 142-148 .. code-block:: Python try: model_onnx = convert_sklearn(clf, "pipeline_titanic", inputs, target_opset=12) except Exception as e: print(e) .. GENERATED FROM PYTHON SOURCE LINES 149-152 *scikit-learn* does implicit conversions when it can. *sklearn-onnx* does not. The ONNX version of *OneHotEncoder* must be applied on columns of the same type. .. GENERATED FROM PYTHON SOURCE LINES 152-166 .. code-block:: Python X_train["pclass"] = X_train["pclass"].astype(str) X_test["pclass"] = X_test["pclass"].astype(str) white_list = numeric_features + categorical_features to_drop = [c for c in X_train.columns if c not in white_list] inputs = convert_dataframe_schema(X_train, to_drop) model_onnx = convert_sklearn(clf, "pipeline_titanic", inputs, target_opset=12) # And save. with open("pipeline_titanic.onnx", "wb") as f: f.write(model_onnx.SerializeToString()) .. GENERATED FROM PYTHON SOURCE LINES 167-173 Compare the predictions +++++++++++++++++++++++ Final step, we need to ensure the converted model produces the same predictions, labels and probabilities. Let's start with *scikit-learn*. .. GENERATED FROM PYTHON SOURCE LINES 173-177 .. code-block:: Python print("predict", clf.predict(X_test[:5])) print("predict_proba", clf.predict_proba(X_test[:1])) .. rst-class:: sphx-glr-script-out .. code-block:: none predict [0 0 1 0 0] predict_proba [[0.60224126 0.39775874]] .. GENERATED FROM PYTHON SOURCE LINES 178-187 Predictions with onnxruntime. We need to remove the dropped columns and to change the double vectors into float vectors as *onnxruntime* does not support double floats. *onnxruntime* does not accept *dataframe*. inputs must be given as a list of dictionary. Last detail, every column was described not really as a vector but as a matrix of one column which explains the last line with the *reshape*. .. GENERATED FROM PYTHON SOURCE LINES 187-195 .. code-block:: Python X_test2 = X_test.drop(to_drop, axis=1) inputs = {c: X_test2[c].values for c in X_test2.columns} for c in numeric_features: inputs[c] = inputs[c].astype(np.float32) for k in inputs: inputs[k] = inputs[k].reshape((inputs[k].shape[0], 1)) .. GENERATED FROM PYTHON SOURCE LINES 196-197 We are ready to run *onnxruntime*. .. GENERATED FROM PYTHON SOURCE LINES 197-204 .. code-block:: Python sess = rt.InferenceSession("pipeline_titanic.onnx", providers=["CPUExecutionProvider"]) pred_onx = sess.run(None, inputs) print("predict", pred_onx[0][:5]) print("predict_proba", pred_onx[1][:1]) .. rst-class:: sphx-glr-script-out .. code-block:: none predict [0 0 1 0 0] predict_proba [{0: 0.7899309396743774, 1: 0.21006903052330017}] .. GENERATED FROM PYTHON SOURCE LINES 205-212 Compute intermediate outputs ++++++++++++++++++++++++++++ Unfortunately, there is actually no way to ask *onnxruntime* to retrieve the output of intermediate nodes. We need to modifies the *ONNX* before it is given to *onnxruntime*. Let's see first the list of intermediate output. .. GENERATED FROM PYTHON SOURCE LINES 212-217 .. code-block:: Python model_onnx = load_onnx_model("pipeline_titanic.onnx") for out in enumerate_model_node_outputs(model_onnx): print(out) .. rst-class:: sphx-glr-script-out .. code-block:: none merged_columns embarkedout sexout pclassout concat_result variable variable2 variable1 transformed_column label probabilities output_label output_probability .. GENERATED FROM PYTHON SOURCE LINES 218-224 Not that easy to tell which one is what as the *ONNX* has more operators than the original *scikit-learn* pipelines. The graph at :ref:`l-plot-complex-pipeline-graph` helps up to find the outputs of both numerical and textual pipeline: *variable1*, *variable2*. Let's look into the numerical pipeline first. .. GENERATED FROM PYTHON SOURCE LINES 224-228 .. code-block:: Python num_onnx = select_model_inputs_outputs(model_onnx, "variable1") save_onnx_model(num_onnx, "pipeline_titanic_numerical.onnx") .. rst-class:: sphx-glr-script-out .. code-block:: none b'\x08\x07\x12\x08skl2onnx\x1a\x061.17.0"\x07ai.onnx(\x002\x00:\xcd\x03\n:\n\x03age\n\x04fare\x12\x0emerged_columns\x1a\x06Concat"\x06Concat*\x0b\n\x04axis\x18\x01\xa0\x01\x02:\x00\n}\n\x0emerged_columns\x12\x08variable\x1a\x07Imputer"\x07Imputer*#\n\x14imputed_value_floats=\x00\x00\xe2A=\xcdLgA\xa0\x01\x06*\x1e\n\x14replaced_value_float\x15\x00\x00\xc0\x7f\xa0\x01\x01:\nai.onnx.ml\n^\n\x08variable\x12\tvariable1\x1a\x06Scaler"\x06Scaler*\x15\n\x06offset=\xe05\xedA=\'\xcb\nB\xa0\x01\x06*\x14\n\x05scale=\'l\x9f==\xdd,\x96<\xa0\x01\x06:\nai.onnx.ml\x12\x10pipeline_titanic*\x1f\x08\x02\x10\x07:\x0b\xff\xff\xff\xff\xff\xff\xff\xff\xff\x01\tB\x0cshape_tensorZ\x16\n\x06pclass\x12\x0c\n\n\x08\x08\x12\x06\n\x00\n\x02\x08\x01Z\x13\n\x03sex\x12\x0c\n\n\x08\x08\x12\x06\n\x00\n\x02\x08\x01Z\x13\n\x03age\x12\x0c\n\n\x08\x01\x12\x06\n\x00\n\x02\x08\x01Z\x14\n\x04fare\x12\x0c\n\n\x08\x01\x12\x06\n\x00\n\x02\x08\x01Z\x18\n\x08embarked\x12\x0c\n\n\x08\x08\x12\x06\n\x00\n\x02\x08\x01b\x0b\n\tvariable1B\x0e\n\nai.onnx.ml\x10\x01B\x04\n\x00\x10\x0b' .. GENERATED FROM PYTHON SOURCE LINES 229-230 Let's compute the numerical features. .. GENERATED FROM PYTHON SOURCE LINES 230-237 .. code-block:: Python sess = rt.InferenceSession( "pipeline_titanic_numerical.onnx", providers=["CPUExecutionProvider"] ) numX = sess.run(None, inputs) print("numerical features", numX[0][:1]) .. rst-class:: sphx-glr-script-out .. code-block:: none numerical features [[-0.7512866 -0.50364053]] .. GENERATED FROM PYTHON SOURCE LINES 238-239 We do the same for the textual features. .. GENERATED FROM PYTHON SOURCE LINES 239-249 .. code-block:: Python print(model_onnx) text_onnx = select_model_inputs_outputs(model_onnx, "variable2") save_onnx_model(text_onnx, "pipeline_titanic_textual.onnx") sess = rt.InferenceSession( "pipeline_titanic_textual.onnx", providers=["CPUExecutionProvider"] ) numT = sess.run(None, inputs) print("textual features", numT[0][:1]) .. rst-class:: sphx-glr-script-out .. code-block:: none ir_version: 7 opset_import { domain: "ai.onnx.ml" version: 1 } opset_import { domain: "" version: 11 } producer_name: "skl2onnx" producer_version: "1.17.0" domain: "ai.onnx" model_version: 0 doc_string: "" graph { node { input: "age" input: "fare" output: "merged_columns" name: "Concat" op_type: "Concat" domain: "" attribute { name: "axis" type: INT i: 1 } } node { input: "embarked" output: "embarkedout" name: "OneHotEncoder" op_type: "OneHotEncoder" domain: "ai.onnx.ml" attribute { name: "cats_strings" type: STRINGS strings: "C" strings: "Q" strings: "S" strings: "missing" } attribute { name: "zeros" type: INT i: 1 } } node { input: "sex" output: "sexout" name: "OneHotEncoder1" op_type: "OneHotEncoder" domain: "ai.onnx.ml" attribute { name: "cats_strings" type: STRINGS strings: "female" strings: "male" } attribute { name: "zeros" type: INT i: 1 } } node { input: "pclass" output: "pclassout" name: "OneHotEncoder2" op_type: "OneHotEncoder" domain: "ai.onnx.ml" attribute { name: "cats_strings" type: STRINGS strings: "1" strings: "2" strings: "3" } attribute { name: "zeros" type: INT i: 1 } } node { input: "embarkedout" input: "sexout" input: "pclassout" output: "concat_result" name: "Concat1" op_type: "Concat" domain: "" attribute { name: "axis" type: INT i: -1 } } node { input: "merged_columns" output: "variable" name: "Imputer" op_type: "Imputer" domain: "ai.onnx.ml" attribute { name: "imputed_value_floats" type: FLOATS floats: 28.25 floats: 14.4562502 } attribute { name: "replaced_value_float" type: FLOAT f: nan } } node { input: "concat_result" input: "shape_tensor" output: "variable2" name: "Reshape" op_type: "Reshape" domain: "" } node { input: "variable" output: "variable1" name: "Scaler" op_type: "Scaler" domain: "ai.onnx.ml" attribute { name: "offset" type: FLOATS floats: 29.6513062 floats: 34.698391 } attribute { name: "scale" type: FLOATS floats: 0.077843 floats: 0.0183319394 } } node { input: "variable1" input: "variable2" output: "transformed_column" name: "Concat2" op_type: "Concat" domain: "" attribute { name: "axis" type: INT i: 1 } } node { input: "transformed_column" output: "label" output: "probabilities" name: "LinearClassifier" op_type: "LinearClassifier" domain: "ai.onnx.ml" attribute { name: "classlabels_ints" type: INTS ints: 0 ints: 1 } attribute { name: "coefficients" type: FLOATS floats: 0.411349356 floats: -0.0257858913 floats: -0.341414243 floats: 0.0805286616 floats: 0.334271878 floats: -0.121588431 floats: -1.24841082 floats: 1.20020878 floats: -0.920275748 floats: -0.037623141 floats: 0.909696758 floats: -0.411349356 floats: 0.0257858913 floats: 0.341414243 floats: -0.0805286616 floats: -0.334271878 floats: 0.121588431 floats: 1.24841082 floats: -1.20020878 floats: 0.920275748 floats: 0.037623141 floats: -0.909696758 } attribute { name: "intercepts" type: FLOATS floats: -0.147927582 floats: 0.147927582 } attribute { name: "multi_class" type: INT i: 0 } attribute { name: "post_transform" type: STRING s: "LOGISTIC" } } node { input: "label" output: "output_label" name: "Cast" op_type: "Cast" domain: "" attribute { name: "to" type: INT i: 7 } } node { input: "probabilities" output: "output_probability" name: "ZipMap" op_type: "ZipMap" domain: "ai.onnx.ml" attribute { name: "classlabels_int64s" type: INTS ints: 0 ints: 1 } } name: "pipeline_titanic" initializer { dims: 2 data_type: 7 int64_data: -1 int64_data: 9 name: "shape_tensor" } input { name: "pclass" type { tensor_type { elem_type: 8 shape { dim { } dim { dim_value: 1 } } } } } input { name: "sex" type { tensor_type { elem_type: 8 shape { dim { } dim { dim_value: 1 } } } } } input { name: "age" type { tensor_type { elem_type: 1 shape { dim { } dim { dim_value: 1 } } } } } input { name: "fare" type { tensor_type { elem_type: 1 shape { dim { } dim { dim_value: 1 } } } } } input { name: "embarked" type { tensor_type { elem_type: 8 shape { dim { } dim { dim_value: 1 } } } } } output { name: "output_label" type { tensor_type { elem_type: 7 shape { dim { } } } } } output { name: "output_probability" type { sequence_type { elem_type { map_type { key_type: 7 value_type { tensor_type { elem_type: 1 } } } } } } } } textual features [[1. 0. 0. 0. 0. 1. 0. 0. 1.]] .. GENERATED FROM PYTHON SOURCE LINES 250-254 Display the sub-ONNX graph ++++++++++++++++++++++++++ Finally, let's see both subgraphs. First, numerical pipeline. .. GENERATED FROM PYTHON SOURCE LINES 254-272 .. code-block:: Python pydot_graph = GetPydotGraph( num_onnx.graph, name=num_onnx.graph.name, rankdir="TB", node_producer=GetOpNodeProducer( "docstring", color="yellow", fillcolor="yellow", style="filled" ), ) pydot_graph.write_dot("pipeline_titanic_num.dot") os.system("dot -O -Gdpi=300 -Tpng pipeline_titanic_num.dot") image = plt.imread("pipeline_titanic_num.dot.png") fig, ax = plt.subplots(figsize=(40, 20)) ax.imshow(image) ax.axis("off") .. image-sg:: /auto_examples/images/sphx_glr_plot_intermediate_outputs_001.png :alt: plot intermediate outputs :srcset: /auto_examples/images/sphx_glr_plot_intermediate_outputs_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none (-0.5, 1229.5, 2558.5, -0.5) .. GENERATED FROM PYTHON SOURCE LINES 273-274 Then textual pipeline. .. GENERATED FROM PYTHON SOURCE LINES 274-292 .. code-block:: Python pydot_graph = GetPydotGraph( text_onnx.graph, name=text_onnx.graph.name, rankdir="TB", node_producer=GetOpNodeProducer( "docstring", color="yellow", fillcolor="yellow", style="filled" ), ) pydot_graph.write_dot("pipeline_titanic_text.dot") os.system("dot -O -Gdpi=300 -Tpng pipeline_titanic_text.dot") image = plt.imread("pipeline_titanic_text.dot.png") fig, ax = plt.subplots(figsize=(40, 20)) ax.imshow(image) ax.axis("off") .. image-sg:: /auto_examples/images/sphx_glr_plot_intermediate_outputs_002.png :alt: plot intermediate outputs :srcset: /auto_examples/images/sphx_glr_plot_intermediate_outputs_002.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none (-0.5, 5630.5, 2735.5, -0.5) .. GENERATED FROM PYTHON SOURCE LINES 293-294 **Versions used for this example** .. .. GENERATED FROM PYTHON SOURCE LINES 293-294 **Versions used for this example** .. GENERATED FROM PYTHON SOURCE LINES 294-300 .. code-block:: Python print("numpy:", numpy.__version__) print("scikit-learn:", sklearn.__version__) print("onnx: ", onnx.__version__) print("onnxruntime: ", rt.__version__) print("skl2onnx: ", skl2onnx.__version__) .. rst-class:: sphx-glr-script-out .. code-block:: none numpy: 1.26.4 scikit-learn: 1.6.dev0 onnx: 1.17.0 onnxruntime: 1.18.0+cu118 skl2onnx: 1.17.0 .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 4.738 seconds)