从Spark的TF-IDF算法中获取重要的单词

时间:2016-12-12 17:09:42

标签: apache-spark apache-spark-mllib tf-idf

你好我使用Spark和它的数据收集相当新。我正在运行Spark的tf-idf示例代码,我现在将结果存储在DataFrame中,如下所示:

>>> rescaledData.show()
+-----+--------------------+--------------------+--------------------+--------------------+
|label|            sentence|               words|         rawFeatures|            features|
+-----+--------------------+--------------------+--------------------+--------------------+
|    0|Hi I heard about ...|[hi, i, heard, ab...|(20,[0,5,9,17],[1...|(20,[0,5,9,17],[0...|
|    0|I wish Java could...|[i, wish, java, c...|(20,[2,7,9,13,15]...|(20,[2,7,9,13,15]...|
|    1|Logistic regressi...|[logistic, regres...|(20,[4,6,13,15,18...|(20,[4,6,13,15,18...|
+-----+--------------------+--------------------+--------------------+--------------------+

>>> rescaledData.select("features").rdd.collect()
[Row(features=SparseVector(20, {0: 0.6931, 5: 0.6931, 9: 0.2877, 17: 1.3863})), Row(features=SparseVector(20, {2: 0.6931, 7: 0.6931, 9: 0.863, 13: 0.2877, 15: 0.2877})), Row(features=SparseVector(20, {4: 0.6931, 6: 0.6931, 13: 0.2877, 15: 0.2877, 18: 0.6931}))]

是否有可能找到我数据集中每个句子的“最重要”单词(tf-idf值最高的单词)?例如,在我的第二句中,具有最高值(0.863)的令牌是令牌号9 - > “Java的。我该如何计算以上内容?

0 个答案:

没有答案