将SparseVector列分解为包含索引和值的行

时间:2018-10-16 23:12:02

标签: apache-spark pyspark

我有从IDF转换生成的SparseVector,看起来像:

user='1234', idf=SparseVector(174, {0: 0.4709, 5: 0.8967, 7: 0.9625, 8: 0.9814,...})

我想将其分解为:

|index|rating|user|
|0    |0.4709|1234|
|5    |0.8967|1234|
|7    |0.9625|1234|
|8    |0.9814|1234|
.
.
.

我的目标是采用这些index,value元组并执行ALS步骤。

1 个答案:

答案 0 :(得分:3)

此任务将需要UserDefinedFunction

from pyspark.sql.functions import udf, explode
from pyspark.ml.linalg import SparseVector, DenseVector

df = spark.createDataFrame([
    ('1234', SparseVector(174, {0: 0.4709, 5: 0.8967, 7: 0.9625, 8: 0.9814}))
]).toDF("user", "idf")

@udf("map<long, double>")
def vector_as_map(v):
   if isinstance(v, SparseVector):
       return dict(zip(v.indices.tolist(), v.values.tolist()))
   elif isinstance(v, DenseVector):
      return dict(zip(range(len(v)), v.values.tolist()))

df.select("user", explode(vector_as_map("idf")).alias("index", "rating")).show()

这将为您带来预期的结果:

+----+-----+------+                                                             
|user|index|rating|
+----+-----+------+
|1234|    0|0.4709|
|1234|    8|0.9814|
|1234|    5|0.8967|
|1234|    7|0.9625|
+----+-----+------+