使用Python在Spark中每个文档的TFIDF稀疏矢量值的总和

时间:2016-02-26 16:09:30

标签: python apache-spark tf-idf apache-spark-mllib

我使用HashingTF和Pyspark的IDF为3个示例文本文档计算了TFIDF,我得到了以下SparseVector结果:

(1048576,[558379],[1.43841036226])
(1048576,[181911,558379,959994],  [0.287682072452,0.287682072452,0.287682072452])
(1048576,[181911,959994],[0.287682072452,0.287682072452])

如何计算文档中所有术语的TFIDF值之和。 例如。 (0.287682072452 + 0.287682072452)用于3d文档。

1 个答案:

答案 0 :(得分:2)

IDF的输出只是一个PySpark SparseVector,当它暴露给Python时,其值是标准的NumPy array所以你所需要的只是sum调用:

from pyspark.mllib.linalg import SparseVector

v = SparseVector(1048576,[181911,959994],[0.287682072452,0.287682072452])
v.values.sum()
## 0.57536414490400001

或以上RDD:

rdd = sc.parallelize([
  SparseVector(1048576,[558379],[1.43841036226]),
  SparseVector(1048576, [181911,558379,959994],  
      [0.287682072452,0.287682072452,0.287682072452]),
  SparseVector(1048576,[181911,959994],[0.287682072452,0.287682072452])])

rdd.map(lambda v: v.values.sum())