我正在尝试使用python中的sklearn2pmml库为SKLearn管道生成PMML文件。该管道仅包含CountVectorizer和SVC模型。管道非常简单,但无法将其作为PMML文件输出。
SKLearn管道:
Pipeline(memory=None,
steps=[('vect',
CountVectorizer(analyzer='word', binary=False,
decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8',
input='content', lowercase=True, max_df=1.0,
max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None,
stop_words=None, strip_accents=None,
token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)),
('model',
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
coef0=0.0, decision_function_shape='ovr', degree=3,
gamma='auto', kernel='linear', max_iter=-1,
probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False))],
verbose=False)
脚本:
from sklearn2pmml import make_pmml_pipeline, sklearn2pmml
pmml_pipe = make_pmml_pipepline(sklearn_pipeline, 'text', 'label')
sklearn2pmml(pmml_pipe, 'outputs/pipeline.pmml')
错误:
Standard output is empty
Standard error:
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 217 ms.
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
INFO: Converting..
Sep 25, 2020 3:51:45 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Attribute 'sklearn.feature_extraction.text.CountVectorizer.tokenizer' has a missing (None/null) value
at org.jpmml.python.PythonObject.get(PythonObject.java:72)
at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:242)
at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:147)
at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:115)
at sklearn.Transformer.encode(Transformer.java:60)
at sklearn.Composite.encodeFeatures(Composite.java:119)
at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:215)
at org.jpmml.sklearn.Main.run(Main.java:233)
at org.jpmml.sklearn.Main.main(Main.java:151)
Exception in thread "main" java.lang.IllegalArgumentException: Attribute 'sklearn.feature_extraction.text.CountVectorizer.tokenizer' has a missing (None/null) value
at org.jpmml.python.PythonObject.get(PythonObject.java:72)
at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:242)
at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:147)
at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:115)
at sklearn.Transformer.encode(Transformer.java:60)
at sklearn.Composite.encodeFeatures(Composite.java:119)
at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:215)
at org.jpmml.sklearn.Main.run(Main.java:233)
at org.jpmml.sklearn.Main.main(Main.java:151)
The JPMML-SkLearn conversion application has failed. The Java executable should have printed more information about the failure into its standard output and/or standard error streams
我不确定自己在做什么错。寻找解决方案。
答案 0 :(得分:0)
您需要使用与PMML兼容的文本标记器。
现在,您正在使用自由格式的正则表达式(CountVectorizer(tokenizer = None, token_pattern = ...)
)标记句子。您需要切换到sklearn2pmml.feature_extraction.text.Splitter
分词器实现(CountVectorizer(tokenizer = Splitter(), token_pattern = None)
)。
SkLearn2PMML / JPMML-SkLearn集成测试套件中的工作示例:https://github.com/jpmml/jpmml-sklearn/blob/1.6.4/src/test/resources/main.py#L537