Question

我在云中获得了意想不到的结果。

我的管道如下：

SlidingWindow(60min).every(1min)
        .triggering(Repeatedly.forever(
            AfterWatermark.pastEndOfWindow()
                .withEarlyFirings(AfterProcessingTime
                                .pastFirstElementInPane()   
                                .plusDelayOf(Duration.standardSeconds(30)))
                                    )
                            )
    .withAllowedLateness(15sec)
    .accumulatingFiredPanes()
.apply("Get UniqueCounts", ApproximateUnique.perKey(.05))
.apply("Window hack filter", ParDo(
      if(window.maxTimestamp.isBeforeNow())
          c.output(element)
    )
)
.toJSON()
.toPubSub()

如果没有那个过滤器，我每输出60个窗口。显然是因为pubsub接收器不能识别窗口。

因此，在下面的示例中，如果每个时间段都是一分钟，那么当滑动窗口关闭时，我希望看到唯一计数增长到60分钟。

使用DirectRunner，我得到了预期的结果：

t1: 5
t2: 10
t3: 15
...
tx: growing unique count

在数据流中，我得到了奇怪的结果：

t1: 5
t2: 10
t3: 0
t4: 0
t5: 2
t6: 0
...
tx: wrong unique count

但是，如果我的无限数据源包含较旧的数据，我会得到看似正常的结果，直到它赶上来，我会得到错误的结果。

我认为它与我的窗口过滤器有关，但删除它并没有改变结果。

如果我做一个Distinct（）然后Count（）。perKey（），它可以工作，但这会大大减慢我的管道。

我在俯瞰什么？

Answer 1

[评论更新] 在提取结果时，ApproximateUnique会无意中重置其累积值。当窗口多次触发时多次读取该值时，这是不正确的。修复（将在2.4版本中）：https://github.com/apache/beam/pull/4688

数据流 - 大约在无界源上唯一

1 个答案: