具有后期元素的Apache Stream TestStream

时间:2017-09-14 22:00:33

标签: apache-beam

我正在尝试使用TestStream进行实验,看看后期元素的处理方式,但却有一些非常有趣且令人困惑的行为。

具体来说,我在一个窗口(windowTwo)中添加一个带有时间戳的元素“2”,然后将水印移动到窗口结束之后但在endOfWindow + Lateness之前,最后,我添加另一个元素“3”窗口中有时间戳。

有趣和令人困惑的事情是:我希望 5 中的所有元素的总和看到 5 ,但它失败并说

  

预期:以任何顺序迭代[< 5>],     但是:不匹配:< 2>

但是,如果我将预期的总和从 2 更改为 5 ,它仍会失败,并说

  

预期:以任何顺序迭代[< 2>],        但是:不匹配:< 5>

发生了什么事?

import org.apache.beam.sdk.coders.BigEndianIntegerCoder;
import org.apache.beam.sdk.testing.NeedsRunner;
import org.apache.beam.sdk.testing.PAssert;
import org.apache.beam.sdk.testing.TestPipeline;
import org.apache.beam.sdk.testing.TestStream;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.Sum;
import org.apache.beam.sdk.transforms.windowing.*;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.TimestampedValue;
import org.joda.time.Duration;
import org.joda.time.Instant;
import org.junit.Rule;
import org.junit.Test;
import org.junit.experimental.categories.Category;

public class BeamAppTest {
    @Rule
    public final transient TestPipeline pipeline = TestPipeline.create();

    @Test
    @Category(NeedsRunner.class)
    public void testApp() {
        final Duration windowLengthMin = Duration.standardMinutes(10);
        final Duration latenessMin = Duration.standardMinutes(5);
        final Duration oneMin = Duration.standardMinutes(1);

        final Instant windowOneStart = new Instant(0L).plus(Duration.standardMinutes(20));
        final Instant windowOneEnd = windowOneStart.plus(windowLengthMin);
        final IntervalWindow windowOne = new IntervalWindow(windowOneStart, windowOneEnd);

        final Instant windowTwoStart = windowOneEnd;
        final Instant windowTwoEnd = windowTwoStart.plus(windowLengthMin);
        final IntervalWindow windowTwo = new IntervalWindow(windowTwoStart, windowTwoEnd);

        TestStream<Integer> testStream = TestStream.create(BigEndianIntegerCoder.of())
            .addElements(TimestampedValue.of(1, windowOneStart.plus(oneMin))) // early window one
            .advanceWatermarkTo(windowOneEnd)                                 // watermark passes window one
            .addElements(TimestampedValue.of(2, windowTwoStart.plus(oneMin))) // early window two
            .advanceWatermarkTo(windowTwoEnd.plus(latenessMin).minus(oneMin)) // water mark passes window two
            .addElements(TimestampedValue.of(3, windowTwoStart.plus(oneMin))) // late window two
            .advanceProcessingTime(oneMin.plus(oneMin))
            .advanceWatermarkToInfinity();

        PCollection<Integer> means = pipeline.apply(testStream).apply(new CalSum(windowLengthMin, latenessMin));

        PAssert.that(means)
            .inWindow(windowOne)
            .containsInAnyOrder(1);

        PAssert.that(means)
            .inWindow(windowTwo)
            .containsInAnyOrder(2);  // change the 2 to 5 here to see magic!!!

        pipeline.run().waitUntilFinish();
    }

    static class CalSum extends PTransform<PCollection<Integer>, PCollection<Integer>> {
        private final Duration WINDOW_LENGTH_MIN;
        private final Duration LATENESS_MIN;

        CalSum(Duration windowLengthMin, Duration latenessMin) {
            WINDOW_LENGTH_MIN = windowLengthMin;
            LATENESS_MIN = latenessMin;
        }

        @Override
        public PCollection<Integer> expand(PCollection<Integer> input) {
            return input
                .apply(Window
                    .<Integer>into(FixedWindows.of(WINDOW_LENGTH_MIN))
                    .withAllowedLateness(LATENESS_MIN)
                    .accumulatingFiredPanes()  // accumulating trigger
                    .triggering(AfterWatermark.pastEndOfWindow()  // trigger at end of window
                        .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
                            .plusDelayOf(Duration.standardMinutes(2)))  // trigger every 2 min within the window
                        .withLateFirings(AfterProcessingTime.pastFirstElementInPane()
                            .plusDelayOf(Duration.standardMinutes(1))))) // trigger every 1 min after the window
                .apply(Sum.integersGlobally().withoutDefaults());
        }
    }
}

1 个答案:

答案 0 :(得分:0)

如上所述,使用元素到达的时间和水印,windowTwo包含两个元素:25。这是您已设置的触发结果:输入2到达的时间戳为windowTwoStart加上一分钟,此时水印位于windowTwo结束之前,并按时完成。然后,水印超过windowTwo的末尾,导致AfterWatermark触发器触发。

在此之后,输入3到达 - 这是在它所在的窗口的水印之后(因此元素已晚),但是在水印已经过了窗口的末尾加上允许的延迟之前(因此元素不可丢弃)。结果,当水印再次前进时,元素与较早的2一起产生(由于已经选择的累积模式),其中它被组合到您观察到的5中。 p>

准时窗格(您可以使用PAssert.that(means).inOnTimePane(windowTwo)匹配)仅包含值2;在窗口的生命周期内,25都会生成,因此inWindow断言会对[2, 5]进行检查。