Question

我正在使用Java-Spark。

我在Kafka的rdd中有以下Java记录（作为字符串）：

{"code":"123", "date":"14/07/2018",....}
{"code":"124", "date":"15/07/2018",....}
{"code":"123", "date":"15/07/2018",....}
{"code":"125", "date":"14/07/2018",....}

我将读入数据集，如下所示：

Dataset<Row> df = sparkSession.read().json(jsonSet);
Dataset<Row> dfSelect = df.select(cols);//Where cols is Column[]

我想通过映射到不同的数据集将JSON记录写入不同的Hive表和不同的分区，意思是：

{"code":"123", "date":"14/07/2018",....} Write to HDFS dir -> /../table123/partition=14_07_2018
{"code":"124", "date":"15/07/2018",....} Write to HDFS dir -> /../table124/partition=15_07_2018
{"code":"123", "date":"15/07/2018",....} Write to HDFS dir -> /../table123/partition=15_07_2018
{"code":"125", "date":"14/07/2018",....} Write to HDFS dir -> /../table125/partition=14_07_2018

如何按代码和日期映射Json，然后按以下方式编写：

dfSelectByTableAndDate123.write().format("parquet").mode("append").save(pathByTableAndDate);
dfSelectByTableAndDate124.write().format("parquet").mode("append").save(pathByTableAndDate);
dfSelectByTableAndDate125.write().format("parquet").mode("append").save(pathByTableAndDate);

谢谢

Answer 1

您可以将json转换为java对象，然后按日期减少它，这将使您按相同日期分组的行。然后，您可以根据需要在下面编写的每个集合都是scala中的伪代码

case class MyType(code: String,date: String)

newDs = df.as[MyType]

newDs.reduceByKey("date").values

将JSON记录过滤到不同的数据集Spark-Java

1 个答案: