Spark SQL-将空值替换为默认值

时间:2018-09-22 19:46:38

标签: scala apache-spark apache-spark-sql

我具有以下数据框架构:

root
 |-- firstname: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- cities: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- postcode: string (nullable = true

我的数据框看起来像这样:

+---------+--------+-----------------------------------+
|firstname|lastname|cities                             |
+---------+--------+-----------------------------------+
|John     |Doe     |[[New York,A000000], [Warsaw,null]]|
|John     |Smith   |[[Berlin,null]]                    |
|John     |null    |[[Paris,null]]                     |
+---------+--------+-----------------------------------+

我想用字符串“ unknown”替换所有空值。 当我使用na.fill函数时,我得到以下数据框:

df.na.fill("unknown").show()

+---------+--------+-----------------------------------+
|firstname|lastname|cities                             |
+---------+--------+-----------------------------------+
|John     |Doe     |[[New York,A000000], [Warsaw,null]]|
|John     |Smith   |[[Berlin,null]]                    |
|John     |unknown |[[Paris,null]]                     |
+---------+--------+-----------------------------------+

如何替换数据帧(包括嵌套数组)中的所有空值?

1 个答案:

答案 0 :(得分:3)

na.fill不会在数组列的struct字段内填充空元素。一种方法是使用如下所示的UDF:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row

case class City(name: String, postcode: String)

val df = Seq(
  ("John", "Doe", Seq(City("New York", "A000000"), City("Warsaw", null))),
  ("John", "Smith", Seq(City("Berlin", null))),
  ("John", null, Seq(City("Paris", null)))
).toDF("firstname", "lastname", "cities")

val defaultStr = "unknown"

def patchNull(default: String) = udf( (s: Seq[Row]) =>
  s.map( r => (r.getAs[String]("name"), r.getAs[String]("postcode")) match {
      case (null, null) => (default, default)
      case (c, null) => (c, default)
      case (null, p) => (default, p)
      case e => e
    }
  ) )

df.
  withColumn( "cities", patchNull(defaultStr)($"cities") ).
  na.fill(defaultStr).
  show(false)
// +---------+--------+--------------------------------------+
// |firstname|lastname|cities                                |
// +---------+--------+--------------------------------------+
// |John     |Doe     |[[New York,A000000], [Warsaw,unknown]]|
// |John     |Smith   |[[Berlin,unknown]]                    |
// |John     |unknown |[[Paris,unknown]]                     |
// +---------+--------+--------------------------------------+