在Scala Spark中将数据帧拆分为多个数据帧

时间:2019-09-27 17:43:33

标签: json scala apache-spark

我在hadoop中有下面的JSON文件(详细信息)。我可以使用SQL Context read json从hd fs读取此文件。然后要根据日期将文件拆分为多个文件,然后在文件名中添加日期(文件中可以有任意多个日期)。

输入文件名:详细信息

{"Name": "Pam", "Address": "", "Gender":"F", "Date": "2019-09-27 06:47:57"}
{"Name": "David", "Address": "", "Gender":"M", "Date": "2019-09-27 10:47:56"}
{"Name": "Mike", "Address": "", "Gender":"M", "Date": "2019-09-26 08:48:57"}

预期的输出文件:

文件名1:details_20190927

{"Name": "Pam", "Address": "", "Gender":"F", "Date": "2019-09-27 06:47:57"}
{"Name": "David", "Address": "", "Gender":"M", "Date": "2019-09-27 10:47:56"}

文件名2:details_20190926

{"Name": "Mike", "Address": "", "Gender":"M", "Date": "2019-09-26 08:48:57"}

1 个答案:

答案 0 :(得分:0)

路径将与您指定的路径不完全相同,但是您可以将记录写在不同的文件上,如下所示:

import org.apache.spark.sql.functions._;
import spark.implicits._

val parsed = spark.read.json("details.json")
val repartitioned = parsed.repartition(col("Date"))
val withPartitionValue = parsed.withColumn("PartitionValue", date_format(col("Date"),"yyyyMMdd"))
withPartitionValue.write.partitionBy("PartitionValue").json("/my/output/folder")
相关问题