Question

我正在从json格式的kafka主题中读取交易。然后   我应用了一些转换以基于   txn_status。下面是架构。

root |-窗口：struct（nullable = true）| |-开始：时间戳   （nullable = true）| |-结束：时间戳记（nullable = true）|-   txn_status：字符串（nullable = true）|-count：long（nullable =   错误）

在给定分组后，我的批处理输出如下所示   窗口。 [！[在此处输入图片描述] [1]] [1]

但是我想要类似json格式的输出。
{
       “start_end_time”: “28/12/2018 11:32:00.000”,
       “count_Total” : 6
       “count_RCVD” : 5,
       “count_FAILED”: 1
  }


> how to combine two rows in a spark dataset.
> 
> 
>   [1]: https://i.stack.imgur.com/sCJuX.jpg

Answer 1

根据您显示的图像，我创建了一个数据框或临时表，并为您的问题提供了解决方案。

标量代码：

case class txn_rec(txn_status: String, count: Int, start_end_time: String)

var txDf=sc.parallelize(Array(new txn_rec("FAIL",9,"2019-03-08 016:40:00, 2019-03-08 016:57:00"), 
    new txn_rec("RCVD",161,"2019-03-08 016:40:00, 2019-03-08 016:57:00"))).toDF

txDf.createOrReplaceTempView("temp")

var resDF=spark.sql("select start_end_time, (select sum(count) from temp) as total_count , (select count from temp where txn_status='RCVD') as rcvd_count,(select count from temp where txn_status='FAIL') as failed_count  from temp group by start_end_time")

resDF.show

resDF.toJSON.collectAsList.toString

您可以看到如屏幕截图所示的输出。

如何使用Java

1 个答案: