我具有以下数据框架构:
root
|-- firstname: string (nullable = true)
|-- lastname: string (nullable = true)
|-- cities: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- postcode: string (nullable = true
我的数据框看起来像这样:
+---------+--------+-----------------------------------+
|firstname|lastname|cities |
+---------+--------+-----------------------------------+
|John |Doe |[[New York,A000000], [Warsaw,null]]|
|John |Smith |[[Berlin,null]] |
|John |null |[[Paris,null]] |
+---------+--------+-----------------------------------+
我想用字符串“ unknown”替换所有空值。 当我使用na.fill函数时,我得到以下数据框:
df.na.fill("unknown").show()
+---------+--------+-----------------------------------+
|firstname|lastname|cities |
+---------+--------+-----------------------------------+
|John |Doe |[[New York,A000000], [Warsaw,null]]|
|John |Smith |[[Berlin,null]] |
|John |unknown |[[Paris,null]] |
+---------+--------+-----------------------------------+
如何替换数据帧(包括嵌套数组)中的所有空值?
答案 0 :(得分:3)
na.fill
不会在数组列的struct字段内填充空元素。一种方法是使用如下所示的UDF:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
case class City(name: String, postcode: String)
val df = Seq(
("John", "Doe", Seq(City("New York", "A000000"), City("Warsaw", null))),
("John", "Smith", Seq(City("Berlin", null))),
("John", null, Seq(City("Paris", null)))
).toDF("firstname", "lastname", "cities")
val defaultStr = "unknown"
def patchNull(default: String) = udf( (s: Seq[Row]) =>
s.map( r => (r.getAs[String]("name"), r.getAs[String]("postcode")) match {
case (null, null) => (default, default)
case (c, null) => (c, default)
case (null, p) => (default, p)
case e => e
}
) )
df.
withColumn( "cities", patchNull(defaultStr)($"cities") ).
na.fill(defaultStr).
show(false)
// +---------+--------+--------------------------------------+
// |firstname|lastname|cities |
// +---------+--------+--------------------------------------+
// |John |Doe |[[New York,A000000], [Warsaw,unknown]]|
// |John |Smith |[[Berlin,unknown]] |
// |John |unknown |[[Paris,unknown]] |
// +---------+--------+--------------------------------------+