Question

在spark Dataset.filter中获取此null错误

输入CSV：

name,age,stat
abc,22,m
xyz,,s

工作代码：

case class Person(name: String, age: Long, stat: String)

val peopleDS = spark.read.option("inferSchema","true")
  .option("header", "true").option("delimiter", ",")
  .csv("./people.csv").as[Person]
peopleDS.show()
peopleDS.createOrReplaceTempView("people")
spark.sql("select * from people where age > 30").show()

代码失败（添加以下行返回错误）：

val filteredDS = peopleDS.filter(_.age > 30)
filteredDS.show()

返回null错误

java.lang.RuntimeException: Null value appeared in non-nullable field:
- field (class: "scala.Long", name: "age")
- root class: "com.gcp.model.Person"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).

Answer 1

你得到的例外应该解释一切，但让我们一步一步：

使用csv数据源加载数据时，所有字段都标记为nullable：

val path: String = ???

val peopleDF = spark.read
  .option("inferSchema","true")
  .option("header", "true")
  .option("delimiter", ",")
  .csv(path)

peopleDF.printSchema

root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- stat: string (nullable = true)

缺少字段表示为SQL NULL

peopleDF.where($"age".isNull).show

+----+----+----+
|name| age|stat|
+----+----+----+
| xyz|null|   s|
+----+----+----+

接下来，您将Dataset[Row]转换为使用Dataset[Person]编码Long字段的age。 Scala中的Long不能是null。由于输入架构为nullable，因此输出架构保持nullable，尽管如此：
```
val peopleDS = peopleDF.as[Person]

peopleDS.printSchema
```
```
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- stat: string (nullable = true)
```
请注意，它as[T]根本不会影响架构。
使用SQL（在注册表上）或Dataset查询DataFrame时，API不会反序列化对象。由于架构仍然是nullable，我们可以执行：
```
peopleDS.where($"age" > 30).show
```
```
+----+---+----+
|name|age|stat|
+----+---+----+
+----+---+----+
```
没有任何问题。这只是一个简单的SQL逻辑，NULL是一个有效值。
当我们使用静态类型Dataset API时：
```
peopleDS.filter(_.age > 30)
```
Spark必须反序列化对象。因为Long不能是null（SQL NULL），所以它会失败并且您已经看到了异常。

如果不是因为你得到了NPE。
更正数据的静态类型表示应使用Optional类型：
```
case class Person(name: String, age: Option[Long], stat: String)
```
具有调整过滤功能：
```
peopleDS.filter(_.age.map(_ > 30).getOrElse(false))
```
```
+----+---+----+
|name|age|stat|
+----+---+----+
+----+---+----+
```
如果您愿意，可以使用模式匹配：
```
peopleDS.filter {
  case Some(age) => age > 30
  case _         => false     // or case None => false
}
```
请注意，您不必（但无论如何建议）使用name和stat的可选类型。因为Scala String只是一个Java String所以它可以null。当然，如果你采用这种方法，你必须明确检查访问的值是否为null。

Spark 2 Dataset Null值异常

1 个答案: