Question

我有一个向量类型的列，每个向量中都有一个值。我只想获取该值并将该列保留为doubleType。

示例输入df：

|testcol|
[1.3]|
[1.2]|
[3.4]|

所需的输出df：

|testcol|
|1.3|
|1.2|
|3.4|

我到目前为止的代码：

remove_vector_func = udf(lambda x: list(x)[0], DoubleType())
ex= ex.withColumn("testcol", remove_vector_func("testcol"))

此代码运行，但是当我尝试显示该列时，它始终抛出错误：

构造ClassDict的预期零参数（用于 numpy.dtype）

我在printSchema（）中看到列类型是正确的：

testcol: double (nullable = true)

Answer 1

您只需要确保您的lambda函数返回的对象与UDF的返回类型匹配即可。在这种情况下，您需要将对象转换为浮点型

代码：

from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

ex = spark.createDataFrame([[1.3],
                            [1.2],
                            [3.4]
                           ], ["test"])

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=["test"],outputCol="testcol")
ex = assembler.transform(ex)
ex.show(5)

# UDF for converting column type from vector to double type
unlist = udf(lambda x: float(list(x)[0]), DoubleType())

ex = ex.withColumn("testcol_new", unlist("testcol"))
ex.show(5)

输出：

从向量列到doubleType的Pyspark转换

1 个答案: