这是我的数据集:
anger,happy food food
anger,dog food food
disgust,food happy food
disgust,food dog food
neutral,food food happy
neutral,food food dog
其次,这是我的代码,我使用CountVectorizer类执行一系列单词。
classValues = {'anger': '0',
'disgust': '1',
'fear': '2',
'happiness':'3',
'sadness': '4',
'surprise': '5',
'neutral': '6'}
def getClass(line):
parts = line.split(',')
return float(classValues[parts[0]])
def getTags(line):
parts = line.split(',')
return parts[1].split(" ")
conf= SparkConf()
conf.setAppName("NaiveBaye")
conf.set('spark.driver.memory','6g')
conf.set('spark.executor.memory','6g')
conf.set('spark.cores.max',156)
sc = SparkContext(conf= conf)
sqlContext = SQLContext(sc)
data = sc.textFile('dataset.txt')
classes = data.map(getClass).collect()
tags = data.map(getTags).collect()
d = {
'tags' : tags,
'classes' : classes
}
df = sqlContext.createDataFrame(pd.DataFrame(data=d))
cv = CountVectorizer(inputCol="tags", outputCol="vectors")
model = cv.fit(df)
model.transform(df).show(truncate=False)
vocabulary = sorted(map(str, model.vocabulary))
print vocabulary
正如您在此处所见:model.transform(df).show(truncate=False)
和print vocabulary
完美无缺。
+-------+-------------------+-------------------+
|classes|tags |vectors |
+-------+-------------------+-------------------+
|0.0 |[happy, food, food]|(3,[0,1],[2.0,1.0])|
|0.0 |[dog, food, food] |(3,[0,2],[2.0,1.0])|
|1.0 |[food, happy, food]|(3,[0,1],[2.0,1.0])|
|1.0 |[food, dog, food] |(3,[0,2],[2.0,1.0])|
|6.0 |[food, food, happy]|(3,[0,1],[2.0,1.0])|
|6.0 |[food, food, dog] |(3,[0,2],[2.0,1.0])|
+-------+-------------------+-------------------+
['dog', 'food', 'happy']
现在,如果我想第二次,使用相同的词汇表执行新元素的矢量化,我该如何在python中执行此操作?
例如
anger, happy dog food
将是
|0.0 |[happy, dog, food]|(3,[0,1, 2],[1.0,1.0,1.0])|
我已经阅读了存在CountVectorizerModel的文档,该文档允许加载存在的词汇表。但是没有关于此的任何记录。
这对我来说非常重要,因为如果我需要对一个新元素进行分类,我需要相同的向量顺序,以便使用我的分类器的相同模型。
我尝过这样的事情:
CountVectorizerModel(vocabulary)
但不起作用。
我目前正在使用 Spark 1.6.1
答案 0 :(得分:2)
来自 spark 2.0 ,它可以在pyspark
中使用,就像持久化并加载其他spark-ml
模型一样。
好的,我们先创建一个模型:
from pyspark.ml.feature import CountVectorizer, CountVectorizerModel
# Input data: Each row is a bag of words with a ID.
df = spark.createDataFrame([
(0, "a b c".split(" ")),
(1, "a b b c a".split(" "))
], ["id", "words"])
# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)
model = cv.fit(df)
result = model.transform(df)
result.show(truncate=False)
# +---+---------------+-------------------------+
# |id |words |features |
# +---+---------------+-------------------------+
# |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
# |1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
# +---+---------------+-------------------------+
然后坚持下去:
model.save("/tmp/count_vec_model")
现在您可以加载并使用它:
same_model = CountVectorizerModel.load("/tmp/count_vec_model")
same_model.transform(df).show(truncate=False)
# +---+---------------+-------------------------+
# |id |words |features |
# +---+---------------+-------------------------+
# |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
# |1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
# +---+---------------+-------------------------+
有关详细信息,请参阅以下有关Saving and loading spark-ml models/pipelines的文档。
模型创建代码示例可在官方文档中找到。