.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_tutorial/plot_transformer_discrepancy.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_tutorial_plot_transformer_discrepancy.py: .. _example-transform-discrepancy: Dealing with discrepancies (tf-idf) =================================== .. index:: td-idf `TfidfVectorizer `_ is one transform for which the corresponding converted onnx model may produce different results. The larger the vocabulary is, the higher the probability to get different result is. This example proposes a equivalent model with no discrepancies. Imports, setups +++++++++++++++ All imports. It also registered onnx converters for :epkg:`xgboost` and :epkg:`lightgbm`. .. GENERATED FROM PYTHON SOURCE LINES 22-60 .. code-block:: Python import pprint import numpy from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.feature_extraction.text import TfidfVectorizer from onnxruntime import InferenceSession from skl2onnx import to_onnx def print_sparse_matrix(m): nonan = numpy.nan_to_num(m) mi, ma = nonan.min(), nonan.max() if mi == ma: ma += 1 mat = numpy.empty(m.shape, dtype=numpy.str_) mat[:, :] = "." if hasattr(m, "todense"): dense = m.todense() else: dense = m for i in range(m.shape[0]): for j in range(m.shape[1]): if dense[i, j] > 0: c = int((dense[i, j] - mi) / (ma - mi) * 25) mat[i, j] = chr(ord("A") + c) return "\n".join("".join(line) for line in mat) def diff(a, b): if a.shape != b.shape: raise ValueError( f"Cannot compare matrices with different shapes {a.shape} != {b.shape}." ) d = numpy.abs(a - b).sum() / a.size return d .. GENERATED FROM PYTHON SOURCE LINES 61-65 Artificial datasets +++++++++++++++++++ Iris + a text column. .. GENERATED FROM PYTHON SOURCE LINES 65-82 .. code-block:: Python strings = numpy.array( [ "This a sentence.", "This a sentence with more characters $^*&'(-...", """var = ClassName(var2, user=mail@anywhere.com, pwd""" """=")_~-('&]@^\\`|[{#")""", "c79857654", "https://complex-url.com/;76543u3456?g=hhh&h=23", "01-03-05T11:12:13", "https://complex-url.com/;dd76543u3456?g=ddhhh&h=23", ] ).reshape((-1, 1)) pprint.pprint(strings) .. rst-class:: sphx-glr-script-out .. code-block:: none array([['This a sentence.'], ["This a sentence with more characters $^*&'(-..."], ['var = ClassName(var2, user=mail@anywhere.com, pwd=")_~-(\'&]@^\\`|[{#")'], ['c79857654'], ['https://complex-url.com/;76543u3456?g=hhh&h=23'], ['01-03-05T11:12:13'], ['https://complex-url.com/;dd76543u3456?g=ddhhh&h=23']], dtype='` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_transformer_discrepancy.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_