你好我使用Spark和它的数据收集相当新。我正在运行Spark的tf-idf示例代码,我现在将结果存储在DataFrame中,如下所示:
>>> rescaledData.show()
+-----+--------------------+--------------------+--------------------+--------------------+
|label| sentence| words| rawFeatures| features|
+-----+--------------------+--------------------+--------------------+--------------------+
| 0|Hi I heard about ...|[hi, i, heard, ab...|(20,[0,5,9,17],[1...|(20,[0,5,9,17],[0...|
| 0|I wish Java could...|[i, wish, java, c...|(20,[2,7,9,13,15]...|(20,[2,7,9,13,15]...|
| 1|Logistic regressi...|[logistic, regres...|(20,[4,6,13,15,18...|(20,[4,6,13,15,18...|
+-----+--------------------+--------------------+--------------------+--------------------+
>>> rescaledData.select("features").rdd.collect()
[Row(features=SparseVector(20, {0: 0.6931, 5: 0.6931, 9: 0.2877, 17: 1.3863})), Row(features=SparseVector(20, {2: 0.6931, 7: 0.6931, 9: 0.863, 13: 0.2877, 15: 0.2877})), Row(features=SparseVector(20, {4: 0.6931, 6: 0.6931, 13: 0.2877, 15: 0.2877, 18: 0.6931}))]
是否有可能找到我数据集中每个句子的“最重要”单词(tf-idf值最高的单词)?例如,在我的第二句中,具有最高值(0.863)的令牌是令牌号9 - > “Java的。我该如何计算以上内容?