如何从Pipeline中提取词汇

时间:2017-10-12 17:27:05

标签: python apache-spark pyspark apache-spark-mllib

我可以通过以下方式从CountVecotizerModel中提取词汇

fl = StopWordsRemover(inputCol="words", outputCol="filtered")
df = fl.transform(df)
cv = CountVectorizer(inputCol="filtered", outputCol="rawFeatures")
model = cv.fit(df)

print(model.vocabulary)

上面的代码将打印带有索引的词汇表列表,因为它是ids。

现在我已经创建了上面代码的管道如下:

rm_stop_words = StopWordsRemover(inputCol="words", outputCol="filtered")
count_freq = CountVectorizer(inputCol=rm_stop_words.getOutputCol(), outputCol="rawFeatures")

pipeline = Pipeline(stages=[rm_stop_words, count_freq])
model = pipeline.fit(dfm)
df = model.transform(dfm)

print(model.vocabulary) # This won't work as it's not CountVectorizerModel

会抛出以下错误

print(len(model.vocabulary))
     

AttributeError:'PipelineModel'对象没有属性'词汇'

那么如何从管道中提取Model属性呢?

1 个答案:

答案 0 :(得分:3)

与任何其他阶段属性一样,提取stages

stages = model.stages

找到您感兴趣的那个(-s):

from pyspark.ml.feature import CountVectorizerModel

vectorizers = [s for s in stages if isinstance(s, CountVectorizerModel)]

并获得所需的字段:

[v.vocabulary for v in vectorizers]