Question

我正在火花中运行一些测试。为此，我正在加载一个csv文件以将结果与之进行比较。

我的标准具

;;NULL;2017-03-21
;;NULL;2017-03-21
;;NULL;2017-03-21

这就是我加载文件的方式

spark.read.schema(Table.schema)
      .format("com.databricks.spark.csv")
      .option("delimiter", ";")
      .option("nullValue", "NULL")
      .load(pathTable)
      .createTempView(param.TABLE)

这是我的模式

  val fields = Seq(
    StructField("balance", StringType, nullable = true),
    StructField("status", StringType, nullable = true),
    StructField("status_date", DateType, nullable = true),
    StructField("time_key", StringType, nullable = true)
  )
  val schema = StructType(fields)

出于某些原因，balance和status应该为空字符串时被加载为NULL。

+-------+------+-----------+----------+
|balance|status|status_date|  time_key|
+-------+------+-----------+----------+
|   null|  null|       null|2017-03-21|
|   null|  null|       null|2017-03-21|
|   null|  null|       null|2017-03-21|
+-------+------+-----------+----------+

为什么会这样，如何将其显示为空字符串？

Answer 1

似乎为此提出了一个问题，该问题已在2.4中解决

SPARK-17916

Answer 2

我不知道为什么会这样，但是

.na.fill("", Seq("balance", "status"))

帮助替换了空值。

Answer 3

这似乎是Spark中的正常行为。请参阅本文Spark’s Treatment of Empty Strings and Blank Values in CSV Files。

要解决此问题，您可以将字符串列中的空值替换为空字符串，如下所示：

df.withColumn("balance", coalesce(col("balance"), lit(""))

加载tempView时，空字符串显示为null

3 个答案: