Question

我有一个简单的工作（Apache Beam SDK for Java 2.2.0），它从PubSub订阅读取消息，从侧面输入读取配置，将转换应用于消息并将结果发送到另一个PubSub主题

问题是传出消息的数量不等于传入消息的数量。我非常快速地从另一个作业插入了1500万条消息（无需手动指定时间戳）。问题似乎伴随着侧面输入的存在，因为没有我没有更多的损失。在Dataflow监控中，我们可以看到大约20000条丢失的消息。

DataflowRunner上的作业ID：2018-01-17_05_33_45-3290466857677892673

如果我重新启动相同的作业，丢失的消息数量就不一样了

我创建了简单的代码片段来说明我的问题

发布商

String PROJECT_ID = "...";

PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);

p
    .apply(GenerateSequence.from(0).to(15000000))
    .apply(MapElements.into(TypeDescriptors.strings()).via(Object::toString))
    .apply(PubsubIO.writeStrings().to("projects/" + PROJECT_ID + "/topics/test_in"));

p.run();

听众

String PROJECT_ID = "...";

PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);

PCollectionView<Long> sideInput = p
    .apply(GenerateSequence.from(0).to(10))
    .apply(Count.globally())
    .apply(View.asSingleton());

p
    // 15,000,000 in input
    .apply(PubsubIO.readMessages().fromSubscription("projects/" + PROJECT_ID + "/subscriptions/test_in"))
    .apply(ParDo.of(new DoFn<PubsubMessage, PubsubMessage>() {
        @ProcessElement
        public void processElement(ProcessContext c) {
            c.output(c.element());
        }
    }).withSideInputs(sideInput))
    // 14,978,010 in output
    .apply(PubsubIO.writeMessages().to("projects/" + PROJECT_ID + "/topics/test_out"));

p.run();

Answer 1

问题很可能是late data dropping引起的。您可以通过设置一个具有无限允许延迟的窗口策略来解决它。

Apache Beam / Dataflow - PubSub丢失了消息

1 个答案: