CountVectorizer,第二次使用相同的词汇表

时间:2018-02-22 09:53:05

标签: python apache-spark pyspark

这是我的数据集:

anger,happy food food
anger,dog food food
disgust,food happy food
disgust,food dog food
neutral,food food happy
neutral,food food dog

其次,这是我的代码,我使用CountVectorizer类执行一系列单词。

classValues =  {'anger':    '0',
            'disgust':  '1',
            'fear':     '2',
            'happiness':'3',
            'sadness':  '4',
            'surprise': '5',
            'neutral':  '6'}

def getClass(line):
    parts = line.split(',')
    return float(classValues[parts[0]])

def getTags(line):
    parts = line.split(',')
    return parts[1].split(" ")

conf= SparkConf()
conf.setAppName("NaiveBaye")
conf.set('spark.driver.memory','6g')
conf.set('spark.executor.memory','6g')
conf.set('spark.cores.max',156)

sc = SparkContext(conf= conf)
sqlContext = SQLContext(sc)

data = sc.textFile('dataset.txt')

classes = data.map(getClass).collect()
tags = data.map(getTags).collect()

d = {
    'tags' : tags,
    'classes' : classes
}

df = sqlContext.createDataFrame(pd.DataFrame(data=d))
cv = CountVectorizer(inputCol="tags", outputCol="vectors")
model = cv.fit(df)
model.transform(df).show(truncate=False)

vocabulary =  sorted(map(str, model.vocabulary))
print vocabulary

正如您在此处所见:model.transform(df).show(truncate=False)print vocabulary完美无缺。

+-------+-------------------+-------------------+
|classes|tags               |vectors            |
+-------+-------------------+-------------------+
|0.0    |[happy, food, food]|(3,[0,1],[2.0,1.0])|
|0.0    |[dog, food, food]  |(3,[0,2],[2.0,1.0])|
|1.0    |[food, happy, food]|(3,[0,1],[2.0,1.0])|
|1.0    |[food, dog, food]  |(3,[0,2],[2.0,1.0])|
|6.0    |[food, food, happy]|(3,[0,1],[2.0,1.0])|
|6.0    |[food, food, dog]  |(3,[0,2],[2.0,1.0])|
+-------+-------------------+-------------------+
['dog', 'food', 'happy']

现在,如果我想第二次,使用相同的词汇表执行新元素的矢量化,我该如何在python中执行此操作?

例如

anger, happy dog food

将是

|0.0    |[happy, dog, food]|(3,[0,1, 2],[1.0,1.0,1.0])|

我已经阅读了存在CountVectorizerModel的文档,该文档允许加载存在的词汇表。但是没有关于此的任何记录。

这对我来说非常重要,因为如果我需要对一个新元素进行分类,我需要相同的向量顺序,以便使用我的分类器的相同模型。

我尝过这样的事情:

CountVectorizerModel(vocabulary)

但不起作用。

编辑1

我目前正在使用 Spark 1.6.1

1 个答案:

答案 0 :(得分:2)

来自 spark 2.0 ,它可以在pyspark中使用,就像持久化并加载其他spark-ml模型一样。

好的,我们先创建一个模型:

from pyspark.ml.feature import CountVectorizer, CountVectorizerModel

# Input data: Each row is a bag of words with a ID.
df = spark.createDataFrame([
    (0, "a b c".split(" ")),
    (1, "a b b c a".split(" "))
], ["id", "words"])

# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)

model = cv.fit(df)

result = model.transform(df)
result.show(truncate=False)
# +---+---------------+-------------------------+
# |id |words          |features                 |
# +---+---------------+-------------------------+
# |0  |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
# |1  |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
# +---+---------------+-------------------------+

然后坚持下去:

model.save("/tmp/count_vec_model")

现在您可以加载并使用它:

same_model = CountVectorizerModel.load("/tmp/count_vec_model")
same_model.transform(df).show(truncate=False)
# +---+---------------+-------------------------+
# |id |words          |features                 |
# +---+---------------+-------------------------+
# |0  |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
# |1  |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
# +---+---------------+-------------------------+

有关详细信息,请参阅以下有关Saving and loading spark-ml models/pipelines的文档。

模型创建代码示例可在官方文档中找到。