带有选项“ nullvalue”的Spark 2.4 CSV加载问题

时间:2019-06-25 10:52:43

标签: scala apache-spark databricks spark-csv

我们以前使用的是Spark 2.3,现在使用的是2.4:

Spark version 2.4.0
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)

我们在生产中运行了一段代码,将csv文件转换为拼花格式。 我们已设置csv加载的选项之一是option(“ nullValue”,null)。它在Spark 2.4中的工作方式有问题。

这是显示问题的示例。

  1. 让我们创建以下/tmp/test.csv文件:
C0,C1,C2,C3,C4,C5
1,"1234",0.00,"","D",0.00
2,"",0.00,"","D",0.00

  1. 现在,如果我们将其加载到spark-shell中
scala> val data1 = spark.read.option("header", "true").option("inferSchema", "true").option("treatEmptyValuesAsNulls","true").option("nullValue", null).csv("file:///tmp/test.csv")

we get an empty row:
scala> data1.show
+----+----+----+----+----+----+
| C0| C1| C2| C3| C4| C5|
+----+----+----+----+----+----+
| 1|1234| 0.0| | D| 0.0|
|null|null|null|null|null|null|
+----+----+----+----+----+----+

  1. 如果我们另外稍微修改了csv(在最后一行中将空字符串替换为“ 1”)
C0,C1,C2,C3,C4,C5
1,"1234",0.00,"","D",0.00
2,"",0.00,"1","D",0.00

结果更糟:

scala> val data2 = spark.read.option("header", "true").option("inferSchema", "true").option("treatEmptyValuesAsNulls","true").option("nullValue", null).csv("file:///tmp/test.csv")

scala> data2.show
+----+----+----+----+----+----+
| C0| C1| C2| C3| C4| C5|
+----+----+----+----+----+----+
|null|null|null|null|null|null|
|null|null|null|null|null|null|
+----+----+----+----+----+----+

这是Spark 2.4.0新版本中的错误吗?任何机构都面临类似的问题?

1 个答案:

答案 0 :(得分:1)

火花选项 emptyValue 已解决问题

task-executor