我有一组文档,每个文档都属于一个特定的页面。我已经计算了每个文档的TFIDF分数,但我想要做的是根据每个文档的平均TFIDF分数。
期望的输出是N(页)×M(词汇)矩阵。我将如何在Spark / PySpark中执行此操作?
from pyspark.ml.feature import CountVectorizer, IDF, Tokenizer, StopWordsRemover
from pyspark.ml import Pipeline
tokenizer = Tokenizer(inputCol="message", outputCol="tokens")
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered")
countVec = CountVectorizer(inputCol=remover.getOutputCol(), outputCol="features", binary=True)
idf = IDF(inputCol=countVec.getOutputCol(), outputCol="idffeatures")
pipeline = Pipeline(stages=[tokenizer, remover, countVec, idf])
model = pipeline.fit(sample_results)
prediction = model.transform(sample_results)
管道的输出格式如下。每个文档一行。
(466,[10,19,24,37,46,61,62,63,66,67,68,86,89,105,107,129,168,217,219,289,310,325,377,381,396,398,411,420,423],[1.6486586255873816,1.6486586255873816,1.8718021769015913,1.8718021769015913,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367])
答案 0 :(得分:0)
我想出了以下答案。它有效,但不确定它是最有效的。我的基础是this post。
def as_matrix(vec):
data, indices = vec.values, vec.indices
shape = 1, vec.size
return csr_matrix((data, indices, np.array([0, vec.values.size])), shape)
def as_array(m):
v = vstack(m).mean(axis=0)
return v
mats = prediction.rdd.map(lambda x: (x['page_name'], as_matrix(x['idffeatures'])))
final = mats.groupByKey().mapValues(as_array).cache()
我将决赛叠加到一个86 x 10000的numpy矩阵中。一切都在运行,但有点慢。
labels = [l[0] for l in final]
tf_matrix = np.vstack([r[1] for r in final])