从Spark Scala DataFrame中仅写入标题CSV记录

时间:2018-06-07 14:02:34

标签: scala apache-spark apache-spark-sql

我的要求是使用Spark Scala DataFrame仅编写标头CSV记录。任何人都可以帮助我。

val OHead1 = "/xxxxx/xxxx/xxxx/xxx/OHead1/" 
val sc = sparkFile.sparkContext
val outDF = csvDF.select("col_01", "col_02", "col_03").schema
sc.parallelize(Seq(outDF.fieldNames.mkString("\t"))).coalesce(1).saveAsTextFile(s"$OHead1")

The above one is working and able to create header in the CSV with tab delimiter. Since I am using spark session I am creating sparkContext in the second line. outDF is my dataframe created before these statements.
Two things are outstanding, can you one of you help me.

1. The above working code is not overriding the files, so every time I need to delete the files manually. I could not find override option, can you help me.
2. Since I am doing a select statement and schema, will it be consider as action and start another lineage for this statement. If it is true then this would degrade the performance.

3 个答案:

答案 0 :(得分:1)

如果您只需要输出标题,可以使用以下代码:

df.schema.fieldNames.reduce(_ + "," + _)

它将创建名称为

的CSV行

答案 1 :(得分:0)

我有一个解决方案来处理这种情况。定义配置文件中的列,并将这些列写入文件中。这是snipet。

val Header = prop.getProperty("OUT_HEADER_COLUMNS").replaceAll("\"","").replaceAll(",","\t")
scala.tools.nsc.io.File(s"$HeadOPath").writeAll(s"$Header")

答案 2 :(得分:0)

I tested and the solution below did not affect any performance.

val OHead1 = "/xxxxx/xxxx/xxxx/xxx/OHead1/" 
val sc = sparkFile.sparkContext
val outDF = csvDF.select("col_01", "col_02", "col_03").schema
sc.parallelize(Seq(outDF.fieldNames.mkString("\t"))).coalesce(1).saveAsTextFile(s"$OHead1")