我在hadoop中有下面的JSON文件(详细信息)。我可以使用SQL Context read json从hd fs读取此文件。然后要根据日期将文件拆分为多个文件,然后在文件名中添加日期(文件中可以有任意多个日期)。
输入文件名:详细信息
{"Name": "Pam", "Address": "", "Gender":"F", "Date": "2019-09-27 06:47:57"}
{"Name": "David", "Address": "", "Gender":"M", "Date": "2019-09-27 10:47:56"}
{"Name": "Mike", "Address": "", "Gender":"M", "Date": "2019-09-26 08:48:57"}
预期的输出文件:
文件名1:details_20190927
{"Name": "Pam", "Address": "", "Gender":"F", "Date": "2019-09-27 06:47:57"}
{"Name": "David", "Address": "", "Gender":"M", "Date": "2019-09-27 10:47:56"}
文件名2:details_20190926
{"Name": "Mike", "Address": "", "Gender":"M", "Date": "2019-09-26 08:48:57"}
答案 0 :(得分:0)
路径将与您指定的路径不完全相同,但是您可以将记录写在不同的文件上,如下所示:
import org.apache.spark.sql.functions._;
import spark.implicits._
val parsed = spark.read.json("details.json")
val repartitioned = parsed.repartition(col("Date"))
val withPartitionValue = parsed.withColumn("PartitionValue", date_format(col("Date"),"yyyyMMdd"))
withPartitionValue.write.partitionBy("PartitionValue").json("/my/output/folder")