我正在从json格式的kafka主题中读取交易。然后 我应用了一些转换以基于 txn_status。下面是架构。
root |-窗口:struct(nullable = true)| |-开始:时间戳 (nullable = true)| |-结束:时间戳记(nullable = true)|- txn_status:字符串(nullable = true)|-count:long(nullable = 错误)
在给定分组后,我的批处理输出如下所示 窗口。 [![在此处输入图片描述] [1]] [1]
但是我想要类似json格式的输出。
{ “start_end_time”: “28/12/2018 11:32:00.000”, “count_Total” : 6 “count_RCVD” : 5, “count_FAILED”: 1 } > how to combine two rows in a spark dataset. > > > [1]: https://i.stack.imgur.com/sCJuX.jpg
答案 0 :(得分:0)
根据您显示的图像,我创建了一个数据框或临时表,并为您的问题提供了解决方案。
标量代码:
case class txn_rec(txn_status: String, count: Int, start_end_time: String)
var txDf=sc.parallelize(Array(new txn_rec("FAIL",9,"2019-03-08 016:40:00, 2019-03-08 016:57:00"),
new txn_rec("RCVD",161,"2019-03-08 016:40:00, 2019-03-08 016:57:00"))).toDF
txDf.createOrReplaceTempView("temp")
var resDF=spark.sql("select start_end_time, (select sum(count) from temp) as total_count , (select count from temp where txn_status='RCVD') as rcvd_count,(select count from temp where txn_status='FAIL') as failed_count from temp group by start_end_time")
resDF.show
resDF.toJSON.collectAsList.toString
您可以看到如屏幕截图所示的输出。