在PySpark中将密集向量转换为稀疏向量

时间:2017-05-25 17:57:58

标签: pyspark

是否有内置的方法从PySpark中的密集向量创建稀疏向量?我这样做的方式如下:

Vectors.sparse(len(denseVector), [(i,j) for i,j in enumerate(denseVector)  if j != 0 ])

满足[size,(index,data)]格式。看起来有点hacky。有没有更有效的方法呢?

2 个答案:

答案 0 :(得分:2)

import scipy.sparse
from pyspark.ml.linalg import Vectors, _convert_to_vector, VectorUDT
from pyspark.sql.functions import udf, col

如果你只有一个密集的矢量,那就可以了:

def dense_to_sparse(vector):
    return _convert_to_vector(scipy.sparse.csc_matrix(vector.toArray()).T)

dense_to_sparse(densevector)

这里的技巧是csc_matrix.shape [1]必须等于1,所以转置矢量。看看_convert_to_vector的来源:https://people.eecs.berkeley.edu/~jegonzal/pyspark/_modules/pyspark/mllib/linalg.html

更可能的情况是你有一个带有一列密集向量的DF:

to_sparse = udf(dense_to_sparse, VectorUDT())
DF.withColumn("sparse", to_sparse(col("densevector"))

答案 1 :(得分:1)

我不确定您使用的是 mllib 还是 ml。无论如何,您可以这样转换:

from pyspark.mllib.linalg import Vectors as mllib_vectors
from pyspark.ml.linalg import Vectors as ml_vectors

# Construct dense vectors in mllib and ml
v1 = mllib_vectors.dense([1.0, 1.0, 0, 0, 0])
v2 = ml_vectors.dense([1.0, 1.0, 0, 0, 0])

# Convert ml dense vector to sparse vector
arr2 = v2.toArray()
print('arr2', arr2)
d = {i:arr2[i] for i in np.nonzero(arr2)[0]}
print('d', d)

v4 = ml_vectors.sparse(len(arr2), d)
print('v4: %s' % v4)


# Convert mllib dense vector to sparse vector
v6 = ml_vectors.sparse(len(arr2), d)
print('v6: %s' % v6)