我对scala非常陌生,并且有一个csv文件:
MSH ModZId ModProd Date
1140000 zzz abc 2/19/2018
1140000 zzz abc 2/19/2018
651 zzz abc 2/19/2018
651 zzz abc 2/19/2018
1140000 zzz abc 2/19/2018
860000 zzz mno 2/26/2018
860000 zzz mno 2/26/2018
122 zzz mno 2/26/2018
122 zzz mno 2/26/2018
860000 zzz mno 2/26/2018
1140000 zzz pxy 2/19/2018
1140000 zzz pxy 2/19/2018
我需要根据日期对csv文件进行分区,并将分区转换为如下所示的拼花地板:
文件夹名称2018/02/19
and parquet file1 output
MSH ModZId ModProd Date
1140000 zzz abc 2/19/2018
1140000 zzz xyz 2/19/2018
651 zzz def 2/19/2018
651 zzz ghi 2/19/2018
1140000 zzz klm 2/19/2018
parquet file2 Output
MSH ModZId ModProd Date
1140000 zzz pxy 2/19/2018
1140000 zzz pxy 2/19/2018
文件夹名称20180226
MSH ModZId ModProd Date
860000 zzz mno 2/26/2018
860000 zzz pqr 2/26/2018
122 zzz stu 2/26/2018
122 zzz wxy 2/26/2018
860000 zzz ijk 2/26/2018
我正在尝试这种方法,不确定如何遍历数据框
val writeDF = df
.select ($"ModProd ",$"Date").distinct().orderBy($"ModProd ",$"Date")
writeDF.show()
df
.write
.mode(SaveMode.Overwrite)
.format("parquet")
.partitionBy("Date")
.save(Path)
}
任何人都可以帮助我。我非常新,不知道如何根据日期在scala中对csv文件进行分区