为SKLearn文本分类管道生成PMML管道时出错

时间:2020-09-25 19:58:16

标签: python scikit-learn pipeline pmml

我正在尝试使用python中的sklearn2pmml库为SKLearn管道生成PMML文件。该管道仅包含CountVectorizer和SVC模型。管道非常简单,但无法将其作为PMML文件输出。

SKLearn管道:

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('model',
                 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='ovr', degree=3,
                     gamma='auto', kernel='linear', max_iter=-1,
                     probability=True, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

脚本:

from sklearn2pmml import make_pmml_pipeline, sklearn2pmml

pmml_pipe = make_pmml_pipepline(sklearn_pipeline, 'text', 'label')
sklearn2pmml(pmml_pipe, 'outputs/pipeline.pmml')

错误:

Standard output is empty
Standard error:
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 217 ms.
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
INFO: Converting..
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Attribute 'sklearn.feature_extraction.text.CountVectorizer.tokenizer' has a missing (None/null) value
    at org.jpmml.python.PythonObject.get(PythonObject.java:72)
    at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:242)
    at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:147)
    at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:115)
    at sklearn.Transformer.encode(Transformer.java:60)
    at sklearn.Composite.encodeFeatures(Composite.java:119)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:215)
    at org.jpmml.sklearn.Main.run(Main.java:233)
    at org.jpmml.sklearn.Main.main(Main.java:151)

Exception in thread "main" java.lang.IllegalArgumentException: Attribute 'sklearn.feature_extraction.text.CountVectorizer.tokenizer' has a missing (None/null) value
    at org.jpmml.python.PythonObject.get(PythonObject.java:72)
    at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:242)
    at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:147)
    at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:115)
    at sklearn.Transformer.encode(Transformer.java:60)
    at sklearn.Composite.encodeFeatures(Composite.java:119)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:215)
    at org.jpmml.sklearn.Main.run(Main.java:233)
    at org.jpmml.sklearn.Main.main(Main.java:151)

The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams

我不确定自己在做什么错。寻找解决方案。

1 个答案:

答案 0 :(得分:0)

您需要使用与PMML兼容的文本标记器。

现在,您正在使用自由格式的正则表达式(CountVectorizer(tokenizer = None, token_pattern = ...))标记句子。您需要切换到sklearn2pmml.feature_extraction.text.Splitter分词器实现(CountVectorizer(tokenizer = Splitter(), token_pattern = None))。

SkLearn2PMML / JPMML-SkLearn集成测试套件中的工作示例:https://github.com/jpmml/jpmml-sklearn/blob/1.6.4/src/test/resources/main.py#L537