Question

我有如下查询

val query = "select 'a' as col_1, ' ' as col_2, col_3,col_4 from mytable"
val df = sqlContext.sql(query)

现在，如果我显示数据框，则如下所示：

col_1|col_2|col_3|col_4
a| |test|test
a| |testa|testa
a| |testb|testb

这是预期的。但是，如果我将此数据帧写入磁盘

df.write
.option("sep",",")
.csv(file)

文件包含以下内容：

a,\"\",test,test
a,\"\",testa,testb
a,\"\",testb,testb

第二列是不正确的，它应该是一个空格，没有引号或其他任何内容。

如何避免这种情况？我希望文件输出为：

a, ,test,test
a, ,testa,testb
a, ,testb,testb

可执行代码测试-

val tempview = "temptest"
val path = "/mnt/test/"
var df = Seq(
  (8, "bat"),
  (64, "mouse"),
  (-27, "horse")
).toDF("number", "word")

df.createOrReplaceTempView(tempview)

df = sqlContext.sql("select 'a' as first, ' ' as second, number, word from temptest")

df.write.mode(SaveMode.Overwrite).option("sep", ",").csv(path)
val l = dbutils.fs.ls(path)
val file = l(l.size - 1)
val output = dbutils.fs.head(path + file.name)
println(output)

输出为-a,\"\",-27,horse

预期的输出-a, ,-27,horse

Answer 1

您应该能够通过不另存为CSV来解决此问题。尝试另存为更健壮（即键入）的数据格式，例如镶木地板或兽人。

我不知道在Spark CSV序列化程序中如何处理字符串，但是很有可能任何带有空格的东西都被双引号引起来。

以我的经验，Spark中的CSV库虽然非常好，但是配置却不太好。你随便拿什么。由于缺乏可配置性，我不得不将数据加载为RDD，然后再进行解析。 CSV不是Spark的最佳数据格式。

仅有空间的列会被清除

1 个答案: