Question

以下是我的源数据，

 Name |Date      |
+-----+----------+
|Azure|2018-07-26|
|AWS  |2018-07-27|
|GCP  |2018-07-28|
|GCP  |2018-07-28|

我已经使用日期列对数据进行了分区

udl_file_df_read.write.format("csv").partitionBy("Date").mode("append").save(outputPath)

val events = spark.read.format("com.databricks.spark.csv").option("inferSchema","true").load(outputPath)

events.show()

输出列名称为(c0,Date)。我不确定为什么缺少原始列名以及如何保留列名？

由于下面的原因，这不是一个重复的问题，这里分区分区以外的其他列都重命名为c0，并且在option中指定base-path无效。

Answer 1

您会获得c0之类的列名，因为问题中使用的CSV格式不会保留列名。

您可以尝试使用

udl_file_df_read
  .write.
  .option("header", "true")
  ...

并类似地阅读

spark
  .read
  .option("header", "true")

Answer 2

写文件时，我可以通过将选项标头设置为true来保留架构，我以前认为我只能使用此选项来读取数据。

udl_file_df_read.write.option（“ header” =“ true”）。 format（“ csv”）。partitionBy（“ Date”）。mode（“ append”）。save（outputPath）

为什么在Spark分区数据中将列重命名为c0，c1？

2 个答案: