如何将所有行作为JSON数组写出流式DataFrame到Kafka?

时间:2019-03-08 20:07:01

标签: apache-spark apache-kafka spark-structured-streaming

我正在寻找一种将Spark Streaming数据写入kafka的解决方案。 我正在使用以下方法将数据写入kafka

df.selectExpr("to_json(struct(*)) AS value").writeStream.format("kafka")

但是我的问题是在向kafka写入数据时显示如下

{"country":"US","plan":postpaid,"value":300}
{"country":"CAN","plan":0.0,"value":30}

我的预期输出是

   [
    {"country":"US","plan":postpaid,"value":300}
    {"country":"CAN","plan":0.0,"value":30}
   ]

我想将数组内的行括起来。如何在Spark Streaming中实现相同的目标?有人可以建议

2 个答案:

答案 0 :(得分:1)

我假设流式DataFrame(df)的架构如下:

root
 |-- country: string (nullable = true)
 |-- plan: string (nullable = true)
 |-- value: string (nullable = true)

我还假设您想将流式DataFrame(df)中的所有行写为( produce )作为单个记录(其中各行位于JSON数组的形式。

如果是这样,您应该groupBy行,collect_list将所有行分组为一个行,然后将其写到Kafka。

// df is a batch DataFrame so I could show for demo purposes
scala> df.show
+-------+--------+-----+
|country|    plan|value|
+-------+--------+-----+
|     US|postpaid|  300|
|    CAN|     0.0|   30|
+-------+--------+-----+

val jsons = df.selectExpr("to_json(struct(*)) AS value")
scala> jsons.show(truncate = false)
+------------------------------------------------+
|value                                           |
+------------------------------------------------+
|{"country":"US","plan":"postpaid","value":"300"}|
|{"country":"CAN","plan":"0.0","value":"30"}     |
+------------------------------------------------+

val grouped = jsons.groupBy().agg(collect_list("value") as "value")
scala> grouped.show(truncate = false)
+-----------------------------------------------------------------------------------------------+
|value                                                                                          |
+-----------------------------------------------------------------------------------------------+
|[{"country":"US","plan":"postpaid","value":"300"}, {"country":"CAN","plan":"0.0","value":"30"}]|
+-----------------------------------------------------------------------------------------------+

我将在DataStreamWriter.foreachBatch中进行上述所有操作,以获取要使用的DataFrame。

答案 1 :(得分:0)

我真的不确定这是否可以实现,但是我还是会在这里发表我的建议。因此,您可以做的就是随后转换数据框:

 //Input  
 inputDF.show(false)
 +---+-------+
 |int|string |
 +---+-------+
 |1  |string1|
 |2  |string2|
 +---+-------+

 //convert that to json
 inputDF.toJSON.show(false)
 +----------------------------+
 |value                       |
 +----------------------------+
 |{"int":1,"string":"string1"}|
 |{"int":2,"string":"string2"}|
 +----------------------------+

 //then use collect and mkString
 println(inputDF.toJSON.collect().mkString("[", "," , "]"))
 [{"int":1,"string":"string1"},{"int":2,"string":"string2"}]