Question

我已经应用了pyspark tf-idf函数并取回了以下结果。

| features |
|----------|
| (35,[7,9,11,12,19,26,33],[1.2039728043259361,1.2039728043259361,1.2039728043259361,1.6094379124341003,1.6094379124341003,1.6094379124341003,1.6094379124341003])  |
| (35,[0,2,4,5,6,11,22],[0.9162907318741551,0.9162907318741551,1.2039728043259361,1.2039728043259361,1.2039728043259361,1.2039728043259361,1.6094379124341003]) |

所以一个数据框有1列（特征），其中包含SparseVectors作为行。

现在我想从这个数据帧构建IndexRowMatrix，这样我就可以运行这里描述的svd函数https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=svd#pyspark.mllib.linalg.distributed.IndexedRowMatrix.computeSVD

我尝试过以下但没有奏效：

mat = RowMatrix(tfidfData.rdd.map(lambda x: x.features))

TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector

我使用RowMatrix因为构造它我不必提供元组但我甚至无法构建RowMatrix。 IndexedRowMatrix对我来说将更加困难。

那么如何在pyspark中的tf-idf数据帧输出上运行IndexedRowMatrix呢？

Answer 1

我能够解决它。因为错误提示RowMatrix不接受pyspark.ml.linalg.SparseVector向量，所以我将此向量转换为pyspark.mllib.linalg注意ml和mllib。现在，以下是将TF-IDF输出转换为RowMatrix的代码片段，并在其上应用computeSVD方法。

from pyspark.mllib.linalg import Vectors
mat = RowMatrix(df.rdd.map(lambda v: Vectors.dense(v.rawFeatures.toArray()) ))

我已转换为密集矩阵，但您可以编写一些额外的代码行以将ml.linalg.SparseVector转换为mllib.linalg.SparseVector

Answer 2

请原谅不发表评论，我还没有必要的声誉。为了加快处理速度，创建一个mllib.linalg.SparseVector是有益的。如果进行了以下修改，它确实非常简单：

from pyspark.mllib.linalg import Vectors
mat = RowMatrix(df.rdd.map(lambda v: Vectors.fromML(v.rawFeatures)))

如何在pyspark中的TF-IDF Dataframe上应用SVD

2 个答案: