Question

我使用KafkaIO API从Kafka主题传输消息 https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/kafka/KafkaIO.html

管道流程如下：

KafkaStream - ＆gt;使用转换器解码消息 - ＆gt;保存到BigQuery

我解码消息并使用BigQueryIO保存到BigQuery。我想知道我是否需要使用窗口。

Window.into[Array[Byte]](FixedWindows.of(Duration.standardSeconds(10)))
        .triggering(
          Repeatedly
            .forever(
              AfterProcessingTime
                .pastFirstElementInPane()
                .plusDelayOf(Duration.standardSeconds(10))
            )
        )
        .withAllowedLateness(Duration.standardSeconds(0))
        .discardingFiredPanes()
    )

根据文档说明，如果我们正在进行任何计算，如GroupByKey等，则需要窗口。由于我只是解码Array Byte消息并将它们存储到BigQuery中，因此可能不需要。

请告诉我，是否需要使用窗口？

Answer 1

已经有一个答案已发布到similar question，其中数据来自PubSub。主要思想是，由于不断添加新元素，因此无法收集无界PCollections的所有元素，因此必须实施以下两种策略之一：

Windowing：您应首先设置非全局窗口函数。
Triggers：您可以设置无界PCollection的触发器，使其无限制数据集定期更新，即使订阅中的数据仍在流动

通过使用以下命令设置选项的相应arg参数，可能还需要在管道中启用Streaming：

pipeline_options.view_as(StandardOptions).streaming = True

来自KafkaIO的Apache Beam Stream - 窗口需求

1 个答案: