Question

我有一个PySpark数据帧，其中一列（features）是稀疏向量。例如：

+------------------+-----+
|     features     |label|
+------------------+-----+
| (4823,[87],[0.0])|  0.0|
| (4823,[31],[2.0])|  0.0|
|(4823,[159],[0.0])|  1.0|
|  (4823,[1],[7.0])|  0.0|
|(4823,[15],[27.0])|  0.0|
+------------------+-----+

我想扩展features列并为其添加其他功能，例如：

+-------------------+-----+
|     features      |label|
+-------------------+-----+
| (4824,[87],[0.0]) |  0.0|
| (4824,[31],[2.0]) |  0.0|
|(4824,[159],[0.0]) |  1.0|
|  (4824,[1],[7.0]) |  0.0|
|(4824,[4824],[7.0])|  0.0|
+-------------------+-----+

有没有一种方法，而无需将SparseVector解压缩为密集的然后重新打包以稀疏新列呢？

Answer 1

使用ML库中的VectorAssembler转换器，最容易地向现有SparseVector添加新列。它将自动将列合并为一个向量（DenseVector或SparseVector，这取决于使用最少的内存）。在合并过程中，使用VectorAssembler不会不将向量转换为DenseVector（请参见source code）。可以使用如下：

df = ...

assembler = VectorAssembler(
    inputCols=["features", "new_col"],
    outputCol="features")

output = assembler.transform(df)

要简单地增加SparseVector的大小而不增加任何新值，只需创建一个具有更大大小的新向量：

def add_empty_col_(v):
    return SparseVector(v.size + 1, v.indices, v.values)

add_empty_col = udf(add_empty_col_, VectorUDT())
df.withColumn("sparse", add_empty_col(col("features"))

Pyspark-将另一列添加到稀疏向量列

1 个答案: