在Apache Flink测试中是否有虚拟时间的概念,就像在Reactor和RxJava中一样

时间:2019-02-24 18:55:16

标签: java rx-java apache-flink flink-streaming project-reactor

在RxJava和Reactor中,存在虚拟时间的概念来测试依赖于时间的运算符。我不知道如何在Flink中做到这一点。例如,我整理了以下示例,在这里我要处理迟到的事件以了解如何处理它们。但是我无法理解这种测试的样子?有没有办法将Flink和Reactor结合起来以使测试更好?

public class PlayWithFlink {

    public static void main(String[] args) throws Exception {

        final OutputTag<MyEvent> lateOutputTag = new OutputTag<MyEvent>("late-data"){};

        // TODO understand how BoundedOutOfOrderness is related to allowedLateness
        BoundedOutOfOrdernessTimestampExtractor<MyEvent> eventTimeFunction = new BoundedOutOfOrdernessTimestampExtractor<MyEvent>(Time.seconds(10)) {
            @Override
            public long extractTimestamp(MyEvent element) {
                return element.getEventTime();
            }
        };

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        DataStream<MyEvent> events = env.fromCollection(MyEvent.examples())
                .assignTimestampsAndWatermarks(eventTimeFunction);

        AggregateFunction<MyEvent, MyAggregate, MyAggregate> aggregateFn = new AggregateFunction<MyEvent, MyAggregate, MyAggregate>() {
            @Override
            public MyAggregate createAccumulator() {
                return new MyAggregate();
            }

            @Override
            public MyAggregate add(MyEvent myEvent, MyAggregate myAggregate) {
                if (myEvent.getTracingId().equals("trace1")) {
                    myAggregate.getTrace1().add(myEvent);
                    return myAggregate;
                }
                myAggregate.getTrace2().add(myEvent);
                return myAggregate;
            }

            @Override
            public MyAggregate getResult(MyAggregate myAggregate) {
                return myAggregate;
            }

            @Override
            public MyAggregate merge(MyAggregate myAggregate, MyAggregate acc1) {
                acc1.getTrace1().addAll(myAggregate.getTrace1());
                acc1.getTrace2().addAll(myAggregate.getTrace2());
                return acc1;
            }
        };

        KeySelector<MyEvent, String> keyFn = new KeySelector<MyEvent, String>() {
            @Override
            public String getKey(MyEvent myEvent) throws Exception {
                return myEvent.getTracingId();
            }
        };

        SingleOutputStreamOperator<MyAggregate> result = events
                .keyBy(keyFn)
                .window(EventTimeSessionWindows.withGap(Time.seconds(10)))
                .allowedLateness(Time.seconds(20))
                .sideOutputLateData(lateOutputTag)
                .aggregate(aggregateFn);


        DataStream lateStream = result.getSideOutput(lateOutputTag);

        result.print("SessionData");

        lateStream.print("LateData");

        env.execute();
    }
}

class MyEvent {
    private final String tracingId;
    private final Integer count;
    private final long eventTime;

    public MyEvent(String tracingId, Integer count, long eventTime) {
        this.tracingId = tracingId;
        this.count = count;
        this.eventTime = eventTime;
    }

    public String getTracingId() {
        return tracingId;
    }

    public Integer getCount() {
        return count;
    }

    public long getEventTime() {
        return eventTime;
    }

    public static List<MyEvent> examples() {
        long now = System.currentTimeMillis();
        MyEvent e1 = new MyEvent("trace1", 1, now);
        MyEvent e2 = new MyEvent("trace2", 1, now);
        MyEvent e3 = new MyEvent("trace2", 1, now - 1000);
        MyEvent e4 = new MyEvent("trace1", 1, now - 200);
        MyEvent e5 = new MyEvent("trace1", 1, now - 50000);
        return Arrays.asList(e1,e2,e3,e4, e5);
    }

    @Override
    public String toString() {
        return "MyEvent{" +
                "tracingId='" + tracingId + '\'' +
                ", count=" + count +
                ", eventTime=" + eventTime +
                '}';
    }
}

class MyAggregate {
    private final List<MyEvent> trace1 = new ArrayList<>();
    private final List<MyEvent> trace2 = new ArrayList<>();


    public List<MyEvent> getTrace1() {
        return trace1;
    }

    public List<MyEvent> getTrace2() {
        return trace2;
    }

    @Override
    public String toString() {
        return "MyAggregate{" +
                "trace1=" + trace1 +
                ", trace2=" + trace2 +
                '}';
    }
}

运行此命令的输出是:

SessionData:1> MyAggregate{trace1=[], trace2=[MyEvent{tracingId='trace2', count=1, eventTime=1551034666081}, MyEvent{tracingId='trace2', count=1, eventTime=1551034665081}]}
SessionData:3> MyAggregate{trace1=[MyEvent{tracingId='trace1', count=1, eventTime=1551034166081}], trace2=[]}
SessionData:3> MyAggregate{trace1=[MyEvent{tracingId='trace1', count=1, eventTime=1551034666081}, MyEvent{tracingId='trace1', count=1, eventTime=1551034665881}], trace2=[]}

但是我希望看到e5事件的lateStream触发应该在第一个事件触发之前50秒。

1 个答案:

答案 0 :(得分:1)

如果您将水印分配器修改为这样

AssignerWithPunctuatedWatermarks eventTimeFunction = new AssignerWithPunctuatedWatermarks<MyEvent>() {
    long maxTs = 0;

    @Override
    public long extractTimestamp(MyEvent myEvent, long l) {
        long ts = myEvent.getEventTime();
        if (ts > maxTs) {
            maxTs = ts;
        }
        return ts;
    }

    @Override
    public Watermark checkAndGetNextWatermark(MyEvent event, long extractedTimestamp) {
        return new Watermark(maxTs - 10000);
    }
};

然后您将获得预期的结果。我不建议这样做-只是用它来说明发生了什么。

这里发生的是BoundedOutOfOrdernessTimestampExtractor是一个周期性的水印生成器,它将仅每200毫秒(默认情况下)将水印插入流中。因为您的工作在那之前很久就完成了,所以您的工作所经历的唯一水印是Flink在每个有限流的末尾注入的水印(值MAX_WATERMARK)。延迟是相对于水印的,您原本打算迟到的事件正在设法到达该水印之前。

通过切换到标点水印,您可以强制水印更频繁地或更精确地出现在流中的特定点。通常这是不必要的(太频繁的加水印会导致开销),但是当您想对水印的顺序进行严格控制时很有用。

关于如何编写测试,您可以看看Flink自己的测试中使用的test harnessesflink-spector

更新:

与BoundedOutOfOrdernessTimestampExtractor关联的时间间隔是预期流如何混乱的规范。在此范围内到达的事件不会被认为是迟到的,事件时间计时器直到延迟时间过去后才会触发,从而为无序事件到达提供了时间。 allowedLateness仅适用于窗口API,它描述了框架保持正常的窗口触发时间多长时间后,框架仍保持窗口状态,以便事件仍可以添加到窗口中并导致延迟触发。在此附加间隔之后,将清除窗口状态并将后续事件发送到侧面输出(如果已配置)。

enter image description here

因此,当您使用BoundedOutOfOrdernessTimestampExtractor<MyEvent>(Time.seconds(10))时,您不是 说“在每个事件之后等待10秒,以防较早的事件可能还会到来”。但是您是说您的事件最多应乱序10秒。因此,如果您正在处理实时实时事件流,这意味着您将最多等待10秒,以防出现较早的事件。 (而且,如果您正在处理历史数据,那么您也许可以在1秒钟内处理10秒的数据,或者不知道-知道您将等待n秒的事件时间过去,这并没有说明实际需要多长时间。 )

有关此主题的更多信息,请参见Event Time and Watermarks