Dataframe as an input#

A pipeline usually ingests data as a matrix. It may be converted in a matrix if all the data share the same type. But data held in a dataframe have usually multiple types, float, integer or string for categories. ONNX also supports that case.

A dataset with categories#

from mlinsights.plotting import pipeline2dot
import numpy
import pprint
from mlprodict.onnx_conv import guess_schema_from_data
from onnxruntime import InferenceSession
from pyquickhelper.helpgen.graphviz_helper import plot_graphviz
from mlprodict.onnxrt import OnnxInference
from mlprodict.onnx_conv import to_onnx as to_onnx_ext
from skl2onnx import to_onnx
from pandas import DataFrame
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier


data = DataFrame([
    dict(CAT1='a', CAT2='c', num1=0.5, num2=0.6, y=0),
    dict(CAT1='b', CAT2='d', num1=0.4, num2=0.8, y=1),
    dict(CAT1='a', CAT2='d', num1=0.5, num2=0.56, y=0),
    dict(CAT1='a', CAT2='d', num1=0.55, num2=0.56, y=1),
    dict(CAT1='a', CAT2='c', num1=0.35, num2=0.86, y=0),
    dict(CAT1='a', CAT2='c', num1=0.5, num2=0.68, y=1),
])

cat_cols = ['CAT1', 'CAT2']
train_data = data.drop('y', axis=1)


categorical_transformer = Pipeline([
    ('onehot', OneHotEncoder(sparse=False, handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, cat_cols)],
    remainder='passthrough')
pipe = Pipeline([('preprocess', preprocessor),
                 ('rf', RandomForestClassifier())])
pipe.fit(train_data, data['y'])

Out:

Pipeline(steps=[('preprocess',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  ['CAT1', 'CAT2'])])),
                ('rf', RandomForestClassifier())])

Display.

dot = pipeline2dot(pipe, train_data)
ax = plot_graphviz(dot)
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plot gbegin dataframe

Conversion to ONNX#

Function to_onnx does not handle dataframes.

try:
    onx = to_onnx(pipe, train_data[:1])
except NotImplementedError as e:
    print(e)

But it possible to use an extended one.

onx = to_onnx_ext(
    pipe, train_data[:1],
    options={RandomForestClassifier: {'zipmap': False}})

Graph#

oinf = OnnxInference(onx)
ax = plot_graphviz(oinf.to_dot())
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plot gbegin dataframe

Prediction with ONNX#

onnxruntime does not support dataframes.

sess = InferenceSession(onx.SerializeToString())
try:
    sess.run(None, train_data)
except Exception as e:
    print(e)

Out:

run(): incompatible function arguments. The following argument types are supported:
    1. (self: onnxruntime.capi.onnxruntime_pybind11_state.InferenceSession, arg0: List[str], arg1: Dict[str, object], arg2: onnxruntime.capi.onnxruntime_pybind11_state.RunOptions) -> List[object]

Invoked with: <onnxruntime.capi.onnxruntime_pybind11_state.InferenceSession object at 0x0000021DD4C37DF0>, ['label', 'probabilities'],   CAT1 CAT2  num1  num2
0    a    c  0.50  0.60
1    b    d  0.40  0.80
2    a    d  0.50  0.56
3    a    d  0.55  0.56
4    a    c  0.35  0.86
5    a    c  0.50  0.68, None

Let’s use a shortcut

oinf = OnnxInference(onx)
got = oinf.run(train_data)
print(pipe.predict(train_data))
print(got['label'])

Out:

[0 1 0 1 0 1]
[0 1 0 1 0 1]

And probilities.

print(pipe.predict_proba(train_data))
print(got['probabilities'])

Out:

[[0.89 0.11]
 [0.43 0.57]
 [0.78 0.22]
 [0.36 0.64]
 [0.74 0.26]
 [0.29 0.71]]
[[0.89       0.10999998]
 [0.43000007 0.56999993]
 [0.78000003 0.21999998]
 [0.36       0.64      ]
 [0.74       0.26      ]
 [0.29000002 0.71      ]]

It looks ok. Let’s dig into the details to directly use onnxruntime.

Unhide conversion logic with a dataframe#

A dataframe can be seen as a set of columns with different types. That’s what ONNX should see: a list of inputs, the input name is the column name, the input type is the column type.

init = guess_schema_from_data(train_data)

pprint.pprint(init)

Out:

[('CAT1', StringTensorType(shape=[None, 1])),
 ('CAT2', StringTensorType(shape=[None, 1])),
 ('num1', DoubleTensorType(shape=[None, 1])),
 ('num2', DoubleTensorType(shape=[None, 1]))]

Let’s use float instead.

for c in train_data.columns:
    if c not in cat_cols:
        train_data[c] = train_data[c].astype(numpy.float32)

init = guess_schema_from_data(train_data)
pprint.pprint(init)

Out:

[('CAT1', StringTensorType(shape=[None, 1])),
 ('CAT2', StringTensorType(shape=[None, 1])),
 ('num1', FloatTensorType(shape=[None, 1])),
 ('num2', FloatTensorType(shape=[None, 1]))]

Let’s convert with skl2onnx only.

onx2 = to_onnx(
    pipe, initial_types=init,
    options={RandomForestClassifier: {'zipmap': False}})

Let’s run it with onnxruntime. We need to convert the dataframe into a dictionary where column names become keys, and column values become values.

inputs = {c: train_data[c].values.reshape((-1, 1))
          for c in train_data.columns}
pprint.pprint(inputs)

Out:

{'CAT1': array([['a'],
       ['b'],
       ['a'],
       ['a'],
       ['a'],
       ['a']], dtype=object),
 'CAT2': array([['c'],
       ['d'],
       ['d'],
       ['d'],
       ['c'],
       ['c']], dtype=object),
 'num1': array([[0.5 ],
       [0.4 ],
       [0.5 ],
       [0.55],
       [0.35],
       [0.5 ]], dtype=float32),
 'num2': array([[0.6 ],
       [0.8 ],
       [0.56],
       [0.56],
       [0.86],
       [0.68]], dtype=float32)}

Inference.

sess2 = InferenceSession(onx2.SerializeToString())

got2 = sess2.run(None, inputs)

print(pipe.predict(train_data))
print(got2[0])

Out:

[0 1 0 1 0 1]
[0 1 0 1 0 1]

And probilities.

print(pipe.predict_proba(train_data))
print(got2[1])

Out:

[[0.89 0.11]
 [0.43 0.57]
 [0.78 0.22]
 [0.36 0.64]
 [0.74 0.26]
 [0.29 0.71]]
[[0.89       0.10999998]
 [0.43000025 0.56999975]
 [0.78       0.22000003]
 [0.3600003  0.6399997 ]
 [0.74       0.26000002]
 [0.29000038 0.7099996 ]]

Total running time of the script: ( 0 minutes 1.270 seconds)

Gallery generated by Sphinx-Gallery