Question

我使用以下代码使用HashingTF和Pyspark的IDF计算TF和IDF：

from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF

sc = SparkContext()

# Load documents (one per line).
documents = sc.textFile("random.txt").map(lambda line: line.split(" "))

hashingTF = HashingTF()
tf = hashingTF.transform(documents)
tf.cache()

idf = IDF(minDocFreq=2).fit(tf)
tfidf = idf.transform(tf)

问题是：我可以使用collect（）方法在屏幕上打印tfidf，但是如何访问其中的特定数据或将整个tfidf矢量空间保存到外部文件或Dataframe？

Answer 1

HashingTF和IDF返回RDD，其中每个元素都是pyspark.mllib.linalg.Vector（Scala中的org.apache.spark.mllib.linalg.Vector）*。这意味着：

您可以使用简单的索引访问各个索引。例如：

documents = sc.textFile("README.md").map(lambda line: line.split(" "))
tf = HashingTF().transform(documents)
idf = IDF().fit(tf)
tfidf = idf.transform(tf)

v = tfidf.first()
v
## SparseVector(1048576, {261052: 0.0, 362890: 0.0, 816618: 1.9253})

type(v)
## pyspark.mllib.linalg.SparseVector

v[0]
## 0.0

可以直接保存到文本文件中。 Vectors提供有意义的字符串表示和parse方法，可用于恢复原始结构。
```
from pyspark.mllib.linalg import Vectors

tfidf.saveAsTextFile("/tmp/tfidf")
sc.textFile("/tmp/tfidf/").map(Vectors.parse)
```

可以放在DataFrame

中

df = tfidf.map(lambda v: (v, )).toDF(["features"])

## df.printSchema()
## root
## |-- features: vector (nullable = true)

df.show(1, False)
## +-------------------------------------------------------------+
## |features                                                     |
## +-------------------------------------------------------------+
## |(1048576,[261052,362890,816618],[0.0,0.0,1.9252908618525775])|
## +-------------------------------------------------------------+
## only showing top 1 row

HashingTF是不可逆转的，因此无法用于提取有关特定令牌的信息。请参阅How to get word details from TF Vector RDD in Spark ML Lib?

处理TFIDF Sparevector中的数据或将其保存到Dataframe或外部文件

1 个答案: