我需要在零件文件中包括数据框的分区列以及数据框标题名称
输入文件(Sample.csv)-
ID|VERSION|RECEIPTREFID|TOTAL|STATUS|TXNSTARTTIME|LAST_MODIFIED_DATE
207|6.1.2.5|01072018|174.25|COMPLETED|2018-07-01 20:26:46.0|2018-07-01 21:10:39.0
144|6.1.2.5|072018|180.25|NOT COMPLETED|2019-02-01 21:18:08.0|2019-02-01 21:42:38.0
分区列将位于年,月和日
代码:
val dfCsv = spark.read.option("delimiter", "|").option("header", "true").csv("/HdfsPath/Sample.csv")
val dfAddcolumns = dfCsv.withColumn("year", year(to_date('txnstarttime))).withColumn("month", month(to_date('txnstarttime))).withColumn("date", to_date('txnstarttime))
val pk = "id,txnstarttime"
val dfTemp = dfAddcolumns.withColumn("pk", concat_ws("",pk.split(",").map(c => col(c)): _*))
val dfRowkey = dfTemp.withColumn("pk", regexp_replace(col("pk"), "[-:.,/ ]", ""))
dfRowkey.withColumn("year1",col("year")).withColumn("month1",col("month")).withColumn("date1",col("date")).write.mode(SaveMode.Overwrite).partitionBy("year", "month", "date").option("header","True").csv("/HdfsPath/output/")
输出-
路径-/HDFS Path/year=2018/month=7/date=2018-07-01
ID,VERSION,RECEIPTREFID,TOTAL,STATUS,TXNSTARTTIME,LAST_MODIFIED_DATE,pk,year1,month1,date1
207,6.1.2.5,01072018,174.25,COMPLETED,2018-07-01 20:26:46.0,2018-07-01 21:10:39.0,207201807012026460,2018,7,2018-07-01
在这里我需要将名称列为年,月和日
如果我像下面那样更改代码,则它没有考虑分区列:
dfRowkey.withColumn("year",col("year")).withColumn("month",col("month")).withColumn("date",col("date")).write.mode(SaveMode.Overwrite).partitionBy("year", "month", "date").option("header","True").csv("/HdfsPath/output/")
则输出为-
ID,VERSION,RECEIPTREFID,TOTAL,STATUS,TXNSTARTTIME,LAST_MODIFIED_DATE,pk
207,6.1.2.5,01072018,174.25,COMPLETED,2018-07-01 20:26:46.0,2018-07-01 21:10:39.0,207201807012026460
预期输出-
路径-目录结构-/HDFS Path/year=2018/month=7/date=2018-07-01
输出应为-
ID,VERSION,RECEIPTREFID,TOTAL,STATUS,TXNSTARTTIME,LAST_MODIFIED_DATE,pk,year,month,date
207,6.1.2.5,01072018,174.25,COMPLETED,2018-07-01 20:26:46.0,2018-07-01 21:10:39.0,207201807012026460,2018,7,2018-07-01