Flink。活动时间。窗口处理太晚了

时间:2019-11-29 16:07:56

标签: apache-flink

以下代码在套接字上接收消息,通过1分钟的窗口(滑动10秒)对消息进行计数,然后用缓存的计数压缩输入。

处理是事件时间。我收到的消息包含我要用于处理的时间戳。

这接近培训练习:Reveal

const settings = {
      dots: true,
      arrows: false,
      fade: true,
      swipeToSlide: false,
      infinite: false,
      slidesToShow: 1,
      slidesToScroll: 1,
      adaptiveHeight: true,
      speed: 200,
      customPaging(i) {
        return (
          <button
            className={`${questions[i].answer && 'active'}`}
            type="button"
          />
        );
      },
      responsive: [
        {
          breakpoint: theme.media.tablet,
          settings: {
            swipeToSlide: true
          }
        }
      ],
      beforeChange: (previous, next) =>
        this.setState({ oldSlide: previous, activeSlide: next }),
      afterChange: previous => this.setState({ activeSlide2: previous })
    };
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    env.setParallelism(1);

    // Input
    SocketTextStreamFunction source = new SocketTextStreamFunction("localhost", 9092, "\n", 0);
    SingleOutputStreamOperator<Tuple2<String, Long>> input = env.addSource(source)
        .map(x -> {
            // Eg: 123;2019-11-29T16:03:44+01:00
            String[] split = x.split(";");
            LocalDateTime ldt = LocalDateTime.parse(split[1], DateTimeFormatter.ISO_OFFSET_DATE_TIME);
            long timestamp = ldt.atZone(ZoneOffset.systemDefault()).toInstant().toEpochMilli();
            return new Tuple2<>(split[0], timestamp);
          });
    // Assign timestamp
    input = input.assignTimestampsAndWatermarks(
        new BoundedOutOfOrdernessTimestampExtractor<Tuple2<String, Long>>(Time.milliseconds(100)) {
          @Override
          public long extractTimestamp(Tuple2<String, Long> element) {
            return element.f1;
          }
        });
    input.print("Received");

    // Count the nb of input in the last minutes, sliding by 10s
    SingleOutputStreamOperator<Tuple2<String, Integer>> count = input
        .map(x -> new Tuple2<>(x.f0, 1))
        .keyBy(0)
        .timeWindow(Time.minutes(1), Time.seconds(10))
        .sum(1);
    count.print("Count");

    // Connect the input and the count
    SingleOutputStreamOperator inputWithCount = input
        .keyBy(0)
        .connect(count.keyBy(0))
        .process(
            new CoProcessFunction<Tuple2<String, Long>, Tuple2<String, Integer>, Tuple3<String, Long, Integer>>() {
              private ValueState<Integer> countCache;

              @Override
              public void open(Configuration parameters) throws Exception {
                ValueStateDescriptor<Integer> desc = new ValueStateDescriptor<>("count", Integer.class);
                countCache = getRuntimeContext().getState(desc);
              }

              @Override
              public void processElement1(Tuple2<String, Long> value, Context ctx, Collector<Tuple3<String, Long, Integer>> out) throws Exception {
                Integer cached = countCache.value();
                if (cached == null) {
                  cached = 0;
                }
                out.collect(new Tuple3<>(value.f0, value.f1, cached));
              }

              @Override
              public void processElement2(Tuple2<String, Integer> value, Context ctx, Collector<Tuple3<String, Long, Integer>> out) throws Exception {
                countCache.update(value.f1);
              }
            });
    inputWithCount.print("Output");

    env.execute("Test");
    // I did not include the import, and I pretty-print the Map function for clarity

现在,当我发送2行时,请等待20秒再发送另一行。我希望第2个输入的计数值为0,第三个输入的计数为2。 我对第一个期望是正确的,而不是第二个期望。

# Start server:
ncat -lk --broker 9092
# Check what's received:
nc localhost 9092


# I run the Flink app, and use the following command
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
sleep 20s ; \                                              
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092

我希望在输出第3个元素之前已经处理完计数。 我误会了活动时间吗?还是我的代码做错了什么?

3 个答案:

答案 0 :(得分:2)

(跟随David Anderson的解释,并提供其他解决方案,请先阅读他的帖子)。

如果您的示例接近实际数据(大量滞后),则还可以选择引入某种空闲超时。对于某些用例,这也是处理空的Kafka分区的推荐方法。

public static class BoundedOutOfOrdernessWithTimeoutTimestampExtractor
        implements AssignerWithPeriodicWatermarks<FakeKafkaRecord> {
    private static final long serialVersionUID = 1L;

    private final long maxOutOfOrderness;
    private final long idle;
    private long recordTimestamp;

    BoundedOutOfOrdernessWithTimeoutTimestampExtractor(Time maxOutOfOrderness, Time idle) {
        this.maxOutOfOrderness = maxOutOfOrderness.toMilliseconds();
        this.idle = idle.toMilliseconds();
    }

    @Nullable
    @Override
    public Watermark getCurrentWatermark() {
        return new Watermark(Math.max(recordTimestamp - maxOutOfOrderness, System.currentTimeMillis() - idle));
    }

    @Override
    public long extractTimestamp(FakeKafkaRecord record, long previousElementTimestamp) {
        return recordTimestamp = record.getTimestamp();
    }
}

根据您的水印间隔查询时间戳分配器。

env.getConfig().setAutoWatermarkInterval(100);

如果BoundedOutOfOrdernessWithTimeoutTimestampExtractoridle期间未收到事件,它将相应地推进水印。您可能想将idle设置为maxOutOfOrderness(100毫秒)。

答案 1 :(得分:1)

问题是,您没有做任何事情来保证在发出第3个元素之前就已经对计数进行了处理-实际上,几乎可以肯定不会。

这样做的原因是,当前水印不能充分前进,以触发窗口,直到第三个事件到达。没关系,您已经等待了20秒的实时时间-重要的是,没有事件通过时间戳提取器,因此没有推进水印的依据。

此外,@Data是一个周期性的水印生成器,默认情况下,每200毫秒仅创建一次新的水印。这意味着您的第3个事件很可能在生成触发窗口的水印之前由BoundedOutOfOrdernessTimestampExtractor处理。

如果您切换到标点水印生成器,则可以获得更多确定性的水印-但是水印仍将跟随第3个事件,因此仍然不会产生您期望的结果。

答案 2 :(得分:1)

谢谢大卫和阿维德!

直到现在我还不了解的是,事件进入系统时会生成水印(“处理时间”是自动的,并遵循服务器时钟)。 而且无论如何,源将变为空闲状态,将不会再发生任何事情。 这正是doc中写的内容,但我错过了。

在以下特定情况下,我得到了期望的输出:

echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
sleep 20s ; \
echo "456;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
sleep 1s ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092
Received> (123,1575640032000)
Received> (123,1575640032000)
Output> (123,1575640032000,0)
Output> (123,1575640032000,0)
... # 20s later
Received> (456,1575640052000)
Output> (456,1575640052000,0)
Count> (123,2)
Count> (123,2)
... # 1s later
Received> (123,1575640053000)
Output> (123,1575640053000,2)

我发现我的输出可能会有所变化,这取决于我是否收到其他事件。在我的用例中,我希望连续输入,但是我希望行为稳定。

有了您的水印功能Arvid,我就会得到我想要的行为,谢谢。我仍然不确定是否可以重播一批输入。我会这么认为,但我仍然会关注Watermark和EventTime。
由于它不是捆绑功能,这使我想知道我是否以错误的方式使用Flink?

作为参考,这是我最后得到的代码。

    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    env.setParallelism(1);

    // Input
    SocketTextStreamFunction source = new SocketTextStreamFunction("localhost", 9092, "\n", 0);
    SingleOutputStreamOperator<Tuple2<String, Long>> input = env.addSource(source)
      .map(new MapFunction<String, Tuple2<String, Long>>() {
        @Override
        public Tuple2<String, Long> map(String value) throws Exception {
          // Eg: 123;2019-11-29T16:03:44+01:00
          String[] split = value.split(";");
          LocalDateTime ldt = LocalDateTime.parse(split[1], DateTimeFormatter.ISO_OFFSET_DATE_TIME);
          long timestamp = ldt.atZone(ZoneOffset.systemDefault()).toInstant().toEpochMilli();
          return new Tuple2<>(split[0], timestamp);
        }
      });
    // Assign timestamp
    input = input.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessWithTimeoutTimestampExtractor(Time.milliseconds(10), Time.milliseconds(10)));
    input.print("Received");

    // Count the nb of input in the last minutes, sliding by 10s
    SingleOutputStreamOperator<Tuple2<String, Integer>> count = input
      .map(new MapFunction<Tuple2<String, Long>, Tuple2<String, Integer>>() {
        @Override
        public Tuple2<String, Integer> map(Tuple2<String, Long> x) throws Exception {
          return new Tuple2<>(x.f0, 1);
        }
      })
      .keyBy(0)
      .timeWindow(Time.minutes(1), Time.seconds(10))
      .sum(1);
    count.print("Count");

    // Connect the input and the count
    SingleOutputStreamOperator<Tuple3<String, Long, Integer>> inputWithCount = input
      .keyBy(0)
      .connect(count.keyBy(0))
      .process(
        new CoProcessFunction<Tuple2<String, Long>, Tuple2<String, Integer>, Tuple3<String, Long, Integer>>() {
          private ValueState<Integer> countCache;
          private long previousCountTimestamp;

          @Override
          public void open(Configuration parameters) throws Exception {
            ValueStateDescriptor<Integer> desc = new ValueStateDescriptor<>("count", Integer.class);
            countCache = getRuntimeContext().getState(desc);
          }

          @Override
          public void processElement1(Tuple2<String, Long> input, Context ctx,
            Collector<Tuple3<String, Long, Integer>> out) throws Exception {
            Integer cached = countCache.value();
            if (cached == null) {
              cached = 0;
            }
            out.collect(new Tuple3<>(input.f0, input.f1, cached));
          }

          @Override
          public void processElement2(Tuple2<String, Integer> count, Context ctx,
            Collector<Tuple3<String, Long, Integer>> out) throws Exception {
            countCache.update(count.f1);

            ctx.timerService().deleteEventTimeTimer(previousCountTimestamp);
            previousCountTimestamp = ctx.timestamp() + Time.minutes(1).toMilliseconds();
            ctx.timerService().registerEventTimeTimer(previousCountTimestamp);
          }

          @Override
          public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple3<String, Long, Integer>> out)
            throws Exception {
            System.out.println("Cache expires");
            countCache.clear();
          }
        });
    inputWithCount.print("Output");

    env.execute("Test");

顺便说一句,我不得不将缓存设置为过期。

输出:

echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
sleep 20s ; \
echo "456;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
sleep 1s ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092
Received> (123,1575641582000)
Received> (123,1575641582000)
Output> (123,1575641582000,0)
Output> (123,1575641582000,0)
... # few s later
Count> (123,2)
... # 10s later
Count> (123,2)
... # few s later
Received> (456,1575641602000)
Output> (456,1575641602000,0)
Received> (123,1575641603000)
Output> (123,1575641603000,2)
... # few s later
Count> (123,3)
Count> (456,1)
... # 10s later
Count> (456,1)
Count> (123,3)
... # 10s later
Count> (456,1)
Count> (123,3)
... # 10s later
Count> (123,3)
Count> (456,1)
...