Tricky issue when converting CountVectorizer or TfidfVectorizer#

This issue is described at scikit-learn/issues/13733. If a CountVectorizer or a TfidfVectorizer produces a token with a space, skl2onnx cannot know if it a bi-grams or a unigram with a space.

A simple example impossible to convert#

import pprint
import numpy
from numpy.testing import assert_almost_equal
from onnxruntime import InferenceSession
from sklearn.feature_extraction.text import TfidfVectorizer
from skl2onnx import to_onnx
from skl2onnx.sklapi import TraceableTfidfVectorizer
import skl2onnx.sklapi.register  # noqa

corpus = numpy.array(
    [
        "This is the first document.",
        "This document is the second document.",
        "Is this the first document?",
        "",
    ]
).reshape((4,))

pattern = r"\b[a-z ]{1,10}\b"
mod1 = TfidfVectorizer(ngram_range=(1, 2), token_pattern=pattern)
mod1.fit(corpus)
TfidfVectorizer(ngram_range=(1, 2), token_pattern='\\b[a-z ]{1,10}\\b')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Unigrams and bi-grams are placed into the following container which maps it to its column index.

pprint.pprint(mod1.vocabulary_)
{'document': 0,
 'document ': 1,
 'document  is the ': 2,
 'is the ': 3,
 'is the  second ': 4,
 'is this ': 5,
 'is this  the first ': 6,
 'second ': 7,
 'second  document': 8,
 'the first ': 9,
 'the first  document': 10,
 'this ': 11,
 'this  document ': 12,
 'this is ': 13,
 'this is  the first ': 14}

Conversion.

try:
    to_onnx(mod1, corpus)
except RuntimeError as e:
    print(e)
There were ambiguities between n-grams and tokens. 2 errors occurred. You can fix it by using class TraceableTfidfVectorizer.
You can learn more at https://github.com/scikit-learn/scikit-learn/issues/13733.
Unable to split n-grams 'is this  the first ' into tokens ('is', 'this', 'the', 'first ') existing in the vocabulary. Token 'is' does not exist in the vocabulary..
Unable to split n-grams 'this is  the first ' into tokens ('this', 'is', 'the', 'first ') existing in the vocabulary. Token 'this' does not exist in the vocabulary..

TraceableTfidfVectorizer#

Class TraceableTfidfVectorizer is equivalent to sklearn.feature_extraction.text.TfidfVectorizer but stores the unigrams and bi-grams of the vocabulary with tuple instead of concatenating every piece into a string.

mod2 = TraceableTfidfVectorizer(ngram_range=(1, 2), token_pattern=pattern)
mod2.fit(corpus)

pprint.pprint(mod2.vocabulary_)
{('document',): 0,
 ('document ',): 1,
 ('document ', 'is the '): 2,
 ('is the ',): 3,
 ('is the ', 'second '): 4,
 ('is this ',): 5,
 ('is this ', 'the first '): 6,
 ('second ',): 7,
 ('second ', 'document'): 8,
 ('the first ',): 9,
 ('the first ', 'document'): 10,
 ('this ',): 11,
 ('this ', 'document '): 12,
 ('this is ',): 13,
 ('this is ', 'the first '): 14}

Let’s check it produces the same results.

assert_almost_equal(mod1.transform(corpus).todense(), mod2.transform(corpus).todense())

Conversion. Line import skl2onnx.sklapi.register was added to register the converters associated to these new class. By default, only converters for scikit-learn are declared.

onx = to_onnx(mod2, corpus)
sess = InferenceSession(onx.SerializeToString(), providers=["CPUExecutionProvider"])
got = sess.run(None, {"X": corpus})

Let’s check if there are discrepancies…

assert_almost_equal(mod2.transform(corpus).todense(), got[0])

Total running time of the script: (0 minutes 0.037 seconds)

Gallery generated by Sphinx-Gallery