在RxJava和Reactor中,存在虚拟时间的概念来测试依赖于时间的运算符。我不知道如何在Flink中做到这一点。例如,我整理了以下示例,在这里我要处理迟到的事件以了解如何处理它们。但是我无法理解这种测试的样子?有没有办法将Flink和Reactor结合起来以使测试更好?
public class PlayWithFlink {
public static void main(String[] args) throws Exception {
final OutputTag<MyEvent> lateOutputTag = new OutputTag<MyEvent>("late-data"){};
// TODO understand how BoundedOutOfOrderness is related to allowedLateness
BoundedOutOfOrdernessTimestampExtractor<MyEvent> eventTimeFunction = new BoundedOutOfOrdernessTimestampExtractor<MyEvent>(Time.seconds(10)) {
@Override
public long extractTimestamp(MyEvent element) {
return element.getEventTime();
}
};
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStream<MyEvent> events = env.fromCollection(MyEvent.examples())
.assignTimestampsAndWatermarks(eventTimeFunction);
AggregateFunction<MyEvent, MyAggregate, MyAggregate> aggregateFn = new AggregateFunction<MyEvent, MyAggregate, MyAggregate>() {
@Override
public MyAggregate createAccumulator() {
return new MyAggregate();
}
@Override
public MyAggregate add(MyEvent myEvent, MyAggregate myAggregate) {
if (myEvent.getTracingId().equals("trace1")) {
myAggregate.getTrace1().add(myEvent);
return myAggregate;
}
myAggregate.getTrace2().add(myEvent);
return myAggregate;
}
@Override
public MyAggregate getResult(MyAggregate myAggregate) {
return myAggregate;
}
@Override
public MyAggregate merge(MyAggregate myAggregate, MyAggregate acc1) {
acc1.getTrace1().addAll(myAggregate.getTrace1());
acc1.getTrace2().addAll(myAggregate.getTrace2());
return acc1;
}
};
KeySelector<MyEvent, String> keyFn = new KeySelector<MyEvent, String>() {
@Override
public String getKey(MyEvent myEvent) throws Exception {
return myEvent.getTracingId();
}
};
SingleOutputStreamOperator<MyAggregate> result = events
.keyBy(keyFn)
.window(EventTimeSessionWindows.withGap(Time.seconds(10)))
.allowedLateness(Time.seconds(20))
.sideOutputLateData(lateOutputTag)
.aggregate(aggregateFn);
DataStream lateStream = result.getSideOutput(lateOutputTag);
result.print("SessionData");
lateStream.print("LateData");
env.execute();
}
}
class MyEvent {
private final String tracingId;
private final Integer count;
private final long eventTime;
public MyEvent(String tracingId, Integer count, long eventTime) {
this.tracingId = tracingId;
this.count = count;
this.eventTime = eventTime;
}
public String getTracingId() {
return tracingId;
}
public Integer getCount() {
return count;
}
public long getEventTime() {
return eventTime;
}
public static List<MyEvent> examples() {
long now = System.currentTimeMillis();
MyEvent e1 = new MyEvent("trace1", 1, now);
MyEvent e2 = new MyEvent("trace2", 1, now);
MyEvent e3 = new MyEvent("trace2", 1, now - 1000);
MyEvent e4 = new MyEvent("trace1", 1, now - 200);
MyEvent e5 = new MyEvent("trace1", 1, now - 50000);
return Arrays.asList(e1,e2,e3,e4, e5);
}
@Override
public String toString() {
return "MyEvent{" +
"tracingId='" + tracingId + '\'' +
", count=" + count +
", eventTime=" + eventTime +
'}';
}
}
class MyAggregate {
private final List<MyEvent> trace1 = new ArrayList<>();
private final List<MyEvent> trace2 = new ArrayList<>();
public List<MyEvent> getTrace1() {
return trace1;
}
public List<MyEvent> getTrace2() {
return trace2;
}
@Override
public String toString() {
return "MyAggregate{" +
"trace1=" + trace1 +
", trace2=" + trace2 +
'}';
}
}
运行此命令的输出是:
SessionData:1> MyAggregate{trace1=[], trace2=[MyEvent{tracingId='trace2', count=1, eventTime=1551034666081}, MyEvent{tracingId='trace2', count=1, eventTime=1551034665081}]}
SessionData:3> MyAggregate{trace1=[MyEvent{tracingId='trace1', count=1, eventTime=1551034166081}], trace2=[]}
SessionData:3> MyAggregate{trace1=[MyEvent{tracingId='trace1', count=1, eventTime=1551034666081}, MyEvent{tracingId='trace1', count=1, eventTime=1551034665881}], trace2=[]}
但是我希望看到e5
事件的lateStream触发应该在第一个事件触发之前50秒。
答案 0 :(得分:1)
如果您将水印分配器修改为这样
AssignerWithPunctuatedWatermarks eventTimeFunction = new AssignerWithPunctuatedWatermarks<MyEvent>() {
long maxTs = 0;
@Override
public long extractTimestamp(MyEvent myEvent, long l) {
long ts = myEvent.getEventTime();
if (ts > maxTs) {
maxTs = ts;
}
return ts;
}
@Override
public Watermark checkAndGetNextWatermark(MyEvent event, long extractedTimestamp) {
return new Watermark(maxTs - 10000);
}
};
然后您将获得预期的结果。我不建议这样做-只是用它来说明发生了什么。
这里发生的是BoundedOutOfOrdernessTimestampExtractor
是一个周期性的水印生成器,它将仅每200毫秒(默认情况下)将水印插入流中。因为您的工作在那之前很久就完成了,所以您的工作所经历的唯一水印是Flink在每个有限流的末尾注入的水印(值MAX_WATERMARK)。延迟是相对于水印的,您原本打算迟到的事件正在设法到达该水印之前。
通过切换到标点水印,您可以强制水印更频繁地或更精确地出现在流中的特定点。通常这是不必要的(太频繁的加水印会导致开销),但是当您想对水印的顺序进行严格控制时很有用。
关于如何编写测试,您可以看看Flink自己的测试中使用的test harnesses或flink-spector。
更新:
与BoundedOutOfOrdernessTimestampExtractor关联的时间间隔是预期流如何混乱的规范。在此范围内到达的事件不会被认为是迟到的,事件时间计时器直到延迟时间过去后才会触发,从而为无序事件到达提供了时间。 allowedLateness仅适用于窗口API,它描述了框架保持正常的窗口触发时间多长时间后,框架仍保持窗口状态,以便事件仍可以添加到窗口中并导致延迟触发。在此附加间隔之后,将清除窗口状态并将后续事件发送到侧面输出(如果已配置)。
因此,当您使用BoundedOutOfOrdernessTimestampExtractor<MyEvent>(Time.seconds(10))
时,您不是 说“在每个事件之后等待10秒,以防较早的事件可能还会到来”。但是您是说您的事件最多应乱序10秒。因此,如果您正在处理实时实时事件流,这意味着您将最多等待10秒,以防出现较早的事件。 (而且,如果您正在处理历史数据,那么您也许可以在1秒钟内处理10秒的数据,或者不知道-知道您将等待n秒的事件时间过去,这并没有说明实际需要多长时间。 )
有关此主题的更多信息,请参见Event Time and Watermarks。