在存储零件文件文件和数据框标题名称的同时,如何包括spark数据框的分区列

时间:2019-05-18 15:39:15

标签: apache-spark dataframe

我需要在零件文件中包括数据框的分区列以及数据框标题名称

输入文件(Sample.csv)-

ID|VERSION|RECEIPTREFID|TOTAL|STATUS|TXNSTARTTIME|LAST_MODIFIED_DATE
207|6.1.2.5|01072018|174.25|COMPLETED|2018-07-01 20:26:46.0|2018-07-01 21:10:39.0
144|6.1.2.5|072018|180.25|NOT COMPLETED|2019-02-01 21:18:08.0|2019-02-01 21:42:38.0

分区列将位于年,月和日

代码:

val dfCsv = spark.read.option("delimiter", "|").option("header", "true").csv("/HdfsPath/Sample.csv")
val dfAddcolumns = dfCsv.withColumn("year", year(to_date('txnstarttime))).withColumn("month", month(to_date('txnstarttime))).withColumn("date", to_date('txnstarttime))
val pk = "id,txnstarttime"
val dfTemp = dfAddcolumns.withColumn("pk", concat_ws("",pk.split(",").map(c => col(c)): _*))
val dfRowkey = dfTemp.withColumn("pk", regexp_replace(col("pk"), "[-:.,/ ]", ""))
dfRowkey.withColumn("year1",col("year")).withColumn("month1",col("month")).withColumn("date1",col("date")).write.mode(SaveMode.Overwrite).partitionBy("year", "month", "date").option("header","True").csv("/HdfsPath/output/")

输出-

路径-/HDFS Path/year=2018/month=7/date=2018-07-01

ID,VERSION,RECEIPTREFID,TOTAL,STATUS,TXNSTARTTIME,LAST_MODIFIED_DATE,pk,year1,month1,date1
207,6.1.2.5,01072018,174.25,COMPLETED,2018-07-01 20:26:46.0,2018-07-01 21:10:39.0,207201807012026460,2018,7,2018-07-01

在这里我需要将名称列为年,月和日

如果我像下面那样更改代码,则它没有考虑分区列:

dfRowkey.withColumn("year",col("year")).withColumn("month",col("month")).withColumn("date",col("date")).write.mode(SaveMode.Overwrite).partitionBy("year", "month", "date").option("header","True").csv("/HdfsPath/output/")

则输出为-

ID,VERSION,RECEIPTREFID,TOTAL,STATUS,TXNSTARTTIME,LAST_MODIFIED_DATE,pk
207,6.1.2.5,01072018,174.25,COMPLETED,2018-07-01 20:26:46.0,2018-07-01 21:10:39.0,207201807012026460

预期输出-

路径-目录结构-/HDFS Path/year=2018/month=7/date=2018-07-01

输出应为-

ID,VERSION,RECEIPTREFID,TOTAL,STATUS,TXNSTARTTIME,LAST_MODIFIED_DATE,pk,year,month,date
207,6.1.2.5,01072018,174.25,COMPLETED,2018-07-01 20:26:46.0,2018-07-01 21:10:39.0,207201807012026460,2018,7,2018-07-01

0 个答案:

没有答案