仅在窗格完成后触发Fire Dataflow

时间:2017-07-18 13:51:48

标签: google-cloud-dataflow

有没有办法设置一个只在窗格完成时触发一次的触发器?通过"完成"我的意思是特别是当水印超过窗口的末端加上任何允许的迟到时。在此之前我不想要任何中间触发器。 我目前正在尝试"假的"此行为是通过设置.withAllowedLateness(Duration.standardHours(1), ClosingBehavior.FIRE_ALWAYS)),然后通过选中if(c.pane().isLast()){ ...来过滤结果 或者更准确地说,是:

Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(PubsubIO.Read.named("ReadFromPubsub").timestampLabel("myts").subscription(INPUT_TOPIC))
.apply("Window", Window.<String>into(Sessions.withGapDuration(Duration.standardMinutes(5)))
    .accumulatingFiredPanes()
    .withAllowedLateness(Duration.standardHours(1), ClosingBehavior.FIRE_ALWAYS))
.apply("Combine", Combine.<String, Metric>perKey(Foo.Merge))
.apply(ParDo.named("FilterComplete").of(Foo.FilterComplete));

其中FilterComplete()类似于:

static final DoFn<String, String> FilterComplete = new DoFn<String, String>() {
  @Override
  public void processElement(ProcessContext c) {
    if(c.pane().isLast()){
      c.output(c.element());
    }
  }
};

虽然这种方法似乎有效,但过滤掉所有未使用的触发器似乎浪费资源。但更重要的是,如果我让流媒体作业运行多天,它会开始抛出java.lang.IllegalStateException: Garbage collection hold . . .个例外,所以我正在寻找重新考虑因素的方法。

完整的例外情况如下:

java.lang.IllegalStateException: Garbage collection hold 2017-07-16T14:55:43.999Z cannot be before input watermark 2017-07-16T15:34:15.000Z
at com.google.cloud.dataflow.worker.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:199)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWatermarkHold.addGarbageCollectionHold(DataflowWatermarkHold.java:402)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWatermarkHold.addEndOfWindowOrGarbageCollectionHolds(DataflowWatermarkHold.java:279)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWatermarkHold.access$000(DataflowWatermarkHold.java:55)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWatermarkHold$1.read(DataflowWatermarkHold.java:534)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWatermarkHold$1.read(DataflowWatermarkHold.java:486)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowReduceFnRunner.onTrigger(DataflowReduceFnRunner.java:971)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowReduceFnRunner.emit(DataflowReduceFnRunner.java:902)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowReduceFnRunner.onTimers(DataflowReduceFnRunner.java:765)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowGABWViaWindowSetFn.processElement(DataflowGABWViaWindowSetFn.java:89)
at com.google.cloud.dataflow.sdk.util.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:49)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.processElement(DoFnRunnerBase.java:139)
at com.google.cloud.dataflow.sdk.util.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:67)
at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:188)
at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.processElement(ForwardingParDoFn.java:42)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerLoggingParDoFn.processElement(DataflowWorkerLoggingParDoFn.java:47)
at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.process(ParDoOperation.java:55)
at com.google.cloud.dataflow.sdk.util.common.worker.OutputReceiver.process(OutputReceiver.java:52)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:221)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:182)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:69)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:719)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.access$600(StreamingDataflowWorker.java:95)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker$8.run(StreamingDataflowWorker.java:801)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

0 个答案:

没有答案