我正在使用Java-Spark。
我在Kafka的rdd中有以下Java记录(作为字符串):
{"code":"123", "date":"14/07/2018",....}
{"code":"124", "date":"15/07/2018",....}
{"code":"123", "date":"15/07/2018",....}
{"code":"125", "date":"14/07/2018",....}
我将读入数据集,如下所示:
Dataset<Row> df = sparkSession.read().json(jsonSet);
Dataset<Row> dfSelect = df.select(cols);//Where cols is Column[]
我想通过映射到不同的数据集将JSON记录写入不同的Hive表和不同的分区, 意思是:
{"code":"123", "date":"14/07/2018",....} Write to HDFS dir -> /../table123/partition=14_07_2018
{"code":"124", "date":"15/07/2018",....} Write to HDFS dir -> /../table124/partition=15_07_2018
{"code":"123", "date":"15/07/2018",....} Write to HDFS dir -> /../table123/partition=15_07_2018
{"code":"125", "date":"14/07/2018",....} Write to HDFS dir -> /../table125/partition=14_07_2018
如何按代码和日期映射Json,然后按以下方式编写:
dfSelectByTableAndDate123.write().format("parquet").mode("append").save(pathByTableAndDate);
dfSelectByTableAndDate124.write().format("parquet").mode("append").save(pathByTableAndDate);
dfSelectByTableAndDate125.write().format("parquet").mode("append").save(pathByTableAndDate);
谢谢
答案 0 :(得分:1)
您可以将json转换为java对象,然后按日期减少它,这将使您按相同日期分组的行。然后,您可以根据需要在下面编写的每个集合都是scala中的伪代码
case class MyType(code: String,date: String)
newDs = df.as[MyType]
newDs.reduceByKey("date").values