Question

我有一个Spark数据框，其中一列是整数数组。该列可以为空，因为它来自左外连接。我想将所有空值转换为空数组，因此我不必在以后处理空值。

我以为我可以这样做：

val myCol = df("myCol")
df.withColumn( "myCol", when(myCol.isNull, Array[Int]()).otherwise(myCol) )

但是，这会导致以下异常：

java.lang.RuntimeException: Unsupported literal type class [I [I@5ed25612
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)
at org.apache.spark.sql.functions$.lit(functions.scala:89)
at org.apache.spark.sql.functions$.when(functions.scala:778)

when函数不支持数组类型。有没有其他简单的方法来转换空值？

如果它是相关的，这是此列的架构：

|-- myCol: array (nullable = true)
|    |-- element: integer (containsNull = false)

Answer 1

您可以使用UDF：

sourceConn_transDB.Open();
using(SqlCommand sourceCommand = new SqlCommand(queryString, sourceConn_transDB))
{
    sourceCommand.CommandTimeout = 600;

    using (SqlDataReader reader = sourceCommand.ExecuteReader())
    using (SqlBulkCopy bulk = new SqlBulkCopy(targetConn_reportDB, SqlBulkCopyOptions.KeepIdentity, null) { DestinationTableName = "PatientEvent" })
    {
        bulk.ColumnMappings.Add(new SqlBulkCopyColumnMapping("PatientID", "PatientID"));
        bulk.WriteToServer(reader);
    }
}

结合import org.apache.spark.sql.functions.udf val array_ = udf(() => Array.empty[Int])或WHEN：

COALESCE

在最新版本中，您可以使用df.withColumn("myCol", when(myCol.isNull, array_()).otherwise(myCol)) df.withColumn("myCol", coalesce(myCol, array_())).show功能：

array

请注意，仅当允许从import org.apache.spark.sql.functions.{array, lit} df.withColumn("foo", array().cast("array<integer>"))转换为所需类型时，它才有效。

Answer 2

对zero323的方法进行了少许修改，我无需在Spark 2.3.1中使用udf就可以做到这一点。

val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
+---+---------+
| id|  numbers|
+---+---------+
|  a|[1, 2, 3]|
|  b|     null|
|  c|[7, 8, 9]|
+---+---------+

val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
+---+---------+
| id|  numbers|
+---+---------+
|  a|[1, 2, 3]|
|  b|       []|
|  c|[7, 8, 9]|
+---+---------+

Answer 3

当无法从StringType强制转换您想要数组元素的数据类型时，可以使用一种无UDF的替代方法：

import pyspark.sql.types as T
import pyspark.sql.functions as F

df.withColumn(
    "myCol",
    F.coalesce(
        F.col("myCol"),
        F.from_json(F.lit("[]"), T.ArrayType(T.IntegerType()))
    )
)

您可以用任何一种数据类型（也可以是复杂的数据类型）替换IntegerType()。

在Spark DataFrame中将空值转换为空数组

3 个答案: