.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_tutorial/plot_ngrams.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_tutorial_plot_ngrams.py: .. _example-ngrams: Tricky issue when converting CountVectorizer or TfidfVectorizer =============================================================== This issue is described at `scikit-learn/issues/13733 `_. If a CountVectorizer or a TfidfVectorizer produces a token with a space, skl2onnx cannot know if it a bi-grams or a unigram with a space. A simple example impossible to convert ++++++++++++++++++++++++++++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 17-41 .. code-block:: default import pprint import numpy from numpy.testing import assert_almost_equal from onnxruntime import InferenceSession from sklearn.feature_extraction.text import TfidfVectorizer from skl2onnx import to_onnx from skl2onnx.sklapi import TraceableTfidfVectorizer import skl2onnx.sklapi.register # noqa corpus = numpy.array( [ "This is the first document.", "This document is the second document.", "Is this the first document?", "", ] ).reshape((4,)) pattern = r"\b[a-z ]{1,10}\b" mod1 = TfidfVectorizer(ngram_range=(1, 2), token_pattern=pattern) mod1.fit(corpus) .. raw:: html
TfidfVectorizer(ngram_range=(1, 2), token_pattern='\\b[a-z ]{1,10}\\b')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 42-44 Unigrams and bi-grams are placed into the following container which maps it to its column index. .. GENERATED FROM PYTHON SOURCE LINES 44-48 .. code-block:: default pprint.pprint(mod1.vocabulary_) .. rst-class:: sphx-glr-script-out .. code-block:: none {'document': 0, 'document ': 1, 'document is the ': 2, 'is the ': 3, 'is the second ': 4, 'is this ': 5, 'is this the first ': 6, 'second ': 7, 'second document': 8, 'the first ': 9, 'the first document': 10, 'this ': 11, 'this document ': 12, 'this is ': 13, 'this is the first ': 14} .. GENERATED FROM PYTHON SOURCE LINES 49-50 Conversion. .. GENERATED FROM PYTHON SOURCE LINES 50-57 .. code-block:: default try: to_onnx(mod1, corpus) except RuntimeError as e: print(e) .. rst-class:: sphx-glr-script-out .. code-block:: none There were ambiguities between n-grams and tokens. 2 errors occurred. You can fix it by using class TraceableTfidfVectorizer. You can learn more at https://github.com/scikit-learn/scikit-learn/issues/13733. Unable to split n-grams 'is this the first ' into tokens ('is', 'this', 'the', 'first ') existing in the vocabulary. Token 'is' does not exist in the vocabulary.. Unable to split n-grams 'this is the first ' into tokens ('this', 'is', 'the', 'first ') existing in the vocabulary. Token 'this' does not exist in the vocabulary.. .. GENERATED FROM PYTHON SOURCE LINES 58-65 TraceableTfidfVectorizer ++++++++++++++++++++++++ Class :class:`TraceableTfidfVectorizer` is equivalent to :class:`sklearn.feature_extraction.text.TfidfVectorizer` but stores the unigrams and bi-grams of the vocabulary with tuple instead of concatenating every piece into a string. .. GENERATED FROM PYTHON SOURCE LINES 65-72 .. code-block:: default mod2 = TraceableTfidfVectorizer(ngram_range=(1, 2), token_pattern=pattern) mod2.fit(corpus) pprint.pprint(mod2.vocabulary_) .. rst-class:: sphx-glr-script-out .. code-block:: none {('document',): 0, ('document ',): 1, ('document ', 'is the '): 2, ('is the ',): 3, ('is the ', 'second '): 4, ('is this ',): 5, ('is this ', 'the first '): 6, ('second ',): 7, ('second ', 'document'): 8, ('the first ',): 9, ('the first ', 'document'): 10, ('this ',): 11, ('this ', 'document '): 12, ('this is ',): 13, ('this is ', 'the first '): 14} .. GENERATED FROM PYTHON SOURCE LINES 73-74 Let's check it produces the same results. .. GENERATED FROM PYTHON SOURCE LINES 74-77 .. code-block:: default assert_almost_equal(mod1.transform(corpus).todense(), mod2.transform(corpus).todense()) .. GENERATED FROM PYTHON SOURCE LINES 78-82 Conversion. Line `import skl2onnx.sklapi.register` was added to register the converters associated to these new class. By default, only converters for scikit-learn are declared. .. GENERATED FROM PYTHON SOURCE LINES 82-87 .. code-block:: default onx = to_onnx(mod2, corpus) sess = InferenceSession(onx.SerializeToString(), providers=["CPUExecutionProvider"]) got = sess.run(None, {"X": corpus}) .. GENERATED FROM PYTHON SOURCE LINES 88-89 Let's check if there are discrepancies... .. GENERATED FROM PYTHON SOURCE LINES 89-91 .. code-block:: default assert_almost_equal(mod2.transform(corpus).todense(), got[0]) .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.037 seconds) .. _sphx_glr_download_auto_tutorial_plot_ngrams.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_ngrams.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_ngrams.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_