Question

如何在不使用flatMapsGroupWithState或Dstream API的情况下使用Structured Streaming 2.3.0在spark中进行无状态聚合？寻找更具说明性的方式

示例：

doc_id

我希望输出只计算每个批次中可用的记录，但不计算上一批次的聚合

Answer 1

要在不使用flatMapsGroupWithState或Dstream API的情况下使用Structured Streaming 2.3.0在spark中进行无状态聚合，您可以使用以下代码 -

import spark.implicits._

def countValues = (_: String, it: Iterator[(String, String)]) => it.length

val query =
  dataStream
    .select(lit("a").as("newKey"), col("value"))
    .as[(String, String)]
    .groupByKey { case(newKey, _) => newKey }
    .mapGroups[Int](countValues)
    .writeStream
    .format("console")
    .start()

我们在这里做的是 -

我们在datastream - newKey添加了一列。我们这样做了，以便我们可以使用groupBy对其进行groupByKey。我使用了文字字符串"a"，但你可以使用任何东西。此外，您需要从datastream中的可用列中选择任意列。为此，我选择了value列，您可以选择任何人。
我们创建了一个映射函数 - countValues，通过编写groupByKey来计算it.length函数聚合的值。

因此，通过这种方式，我们可以计算每批中可用的记录，但不会从上一批中汇总。

我希望它有所帮助！

如何在不使用flatMapsGroupWithState的情况下使用Structured Streaming 2.3.0在spark中进行无状态聚合？

1 个答案: