Question

以下代码在套接字上接收消息，通过1分钟的窗口（滑动10秒）对消息进行计数，然后用缓存的计数压缩输入。

处理是事件时间。我收到的消息包含我要用于处理的时间戳。

这接近培训练习：Reveal

const settings = {
      dots: true,
      arrows: false,
      fade: true,
      swipeToSlide: false,
      infinite: false,
      slidesToShow: 1,
      slidesToScroll: 1,
      adaptiveHeight: true,
      speed: 200,
      customPaging(i) {
        return (
          <button
            className={`${questions[i].answer && 'active'}`}
            type="button"
          />
        );
      },
      responsive: [
        {
          breakpoint: theme.media.tablet,
          settings: {
            swipeToSlide: true
          }
        }
      ],
      beforeChange: (previous, next) =>
        this.setState({ oldSlide: previous, activeSlide: next }),
      afterChange: previous => this.setState({ activeSlide2: previous })
    };

    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    env.setParallelism(1);

    // Input
    SocketTextStreamFunction source = new SocketTextStreamFunction("localhost", 9092, "\n", 0);
    SingleOutputStreamOperator<Tuple2<String, Long>> input = env.addSource(source)
        .map(x -> {
            // Eg: 123;2019-11-29T16:03:44+01:00
            String[] split = x.split(";");
            LocalDateTime ldt = LocalDateTime.parse(split[1], DateTimeFormatter.ISO_OFFSET_DATE_TIME);
            long timestamp = ldt.atZone(ZoneOffset.systemDefault()).toInstant().toEpochMilli();
            return new Tuple2<>(split[0], timestamp);
          });
    // Assign timestamp
    input = input.assignTimestampsAndWatermarks(
        new BoundedOutOfOrdernessTimestampExtractor<Tuple2<String, Long>>(Time.milliseconds(100)) {
          @Override
          public long extractTimestamp(Tuple2<String, Long> element) {
            return element.f1;
          }
        });
    input.print("Received");

    // Count the nb of input in the last minutes, sliding by 10s
    SingleOutputStreamOperator<Tuple2<String, Integer>> count = input
        .map(x -> new Tuple2<>(x.f0, 1))
        .keyBy(0)
        .timeWindow(Time.minutes(1), Time.seconds(10))
        .sum(1);
    count.print("Count");

    // Connect the input and the count
    SingleOutputStreamOperator inputWithCount = input
        .keyBy(0)
        .connect(count.keyBy(0))
        .process(
            new CoProcessFunction<Tuple2<String, Long>, Tuple2<String, Integer>, Tuple3<String, Long, Integer>>() {
              private ValueState<Integer> countCache;

              @Override
              public void open(Configuration parameters) throws Exception {
                ValueStateDescriptor<Integer> desc = new ValueStateDescriptor<>("count", Integer.class);
                countCache = getRuntimeContext().getState(desc);
              }

              @Override
              public void processElement1(Tuple2<String, Long> value, Context ctx, Collector<Tuple3<String, Long, Integer>> out) throws Exception {
                Integer cached = countCache.value();
                if (cached == null) {
                  cached = 0;
                }
                out.collect(new Tuple3<>(value.f0, value.f1, cached));
              }

              @Override
              public void processElement2(Tuple2<String, Integer> value, Context ctx, Collector<Tuple3<String, Long, Integer>> out) throws Exception {
                countCache.update(value.f1);
              }
            });
    inputWithCount.print("Output");

    env.execute("Test");
    // I did not include the import, and I pretty-print the Map function for clarity

现在，当我发送2行时，请等待20秒再发送另一行。我希望第2个输入的计数值为0，第三个输入的计数为2。我对第一个期望是正确的，而不是第二个期望。

# Start server:
ncat -lk --broker 9092
# Check what's received:
nc localhost 9092


# I run the Flink app, and use the following command
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
sleep 20s ; \                                              
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092

我希望在输出第3个元素之前已经处理完计数。我误会了活动时间吗？还是我的代码做错了什么？

Answer 1

（跟随David Anderson的解释，并提供其他解决方案，请先阅读他的帖子）。

如果您的示例接近实际数据（大量滞后），则还可以选择引入某种空闲超时。对于某些用例，这也是处理空的Kafka分区的推荐方法。

public static class BoundedOutOfOrdernessWithTimeoutTimestampExtractor
        implements AssignerWithPeriodicWatermarks<FakeKafkaRecord> {
    private static final long serialVersionUID = 1L;

    private final long maxOutOfOrderness;
    private final long idle;
    private long recordTimestamp;

    BoundedOutOfOrdernessWithTimeoutTimestampExtractor(Time maxOutOfOrderness, Time idle) {
        this.maxOutOfOrderness = maxOutOfOrderness.toMilliseconds();
        this.idle = idle.toMilliseconds();
    }

    @Nullable
    @Override
    public Watermark getCurrentWatermark() {
        return new Watermark(Math.max(recordTimestamp - maxOutOfOrderness, System.currentTimeMillis() - idle));
    }

    @Override
    public long extractTimestamp(FakeKafkaRecord record, long previousElementTimestamp) {
        return recordTimestamp = record.getTimestamp();
    }
}

根据您的水印间隔查询时间戳分配器。

env.getConfig().setAutoWatermarkInterval(100);

如果BoundedOutOfOrdernessWithTimeoutTimestampExtractor在idle期间未收到事件，它将相应地推进水印。您可能想将idle设置为maxOutOfOrderness（100毫秒）。

Answer 2

问题是，您没有做任何事情来保证在发出第3个元素之前就已经对计数进行了处理-实际上，几乎可以肯定不会。

这样做的原因是，当前水印不能充分前进，以触发窗口，直到第三个事件到达。没关系，您已经等待了20秒的实时时间-重要的是，没有事件通过时间戳提取器，因此没有推进水印的依据。

此外，@Data是一个周期性的水印生成器，默认情况下，每200毫秒仅创建一次新的水印。这意味着您的第3个事件很可能在生成触发窗口的水印之前由BoundedOutOfOrdernessTimestampExtractor处理。

如果您切换到标点水印生成器，则可以获得更多确定性的水印-但是水印仍将跟随第3个事件，因此仍然不会产生您期望的结果。

Answer 3

谢谢大卫和阿维德！

直到现在我还不了解的是，事件进入系统时会生成水印（“处理时间”是自动的，并遵循服务器时钟）。而且无论如何，源将变为空闲状态，将不会再发生任何事情。这正是doc中写的内容，但我错过了。

在以下特定情况下，我得到了期望的输出：

echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
sleep 20s ; \
echo "456;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
sleep 1s ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092

Received> (123,1575640032000)
Received> (123,1575640032000)
Output> (123,1575640032000,0)
Output> (123,1575640032000,0)
... # 20s later
Received> (456,1575640052000)
Output> (456,1575640052000,0)
Count> (123,2)
Count> (123,2)
... # 1s later
Received> (123,1575640053000)
Output> (123,1575640053000,2)

我发现我的输出可能会有所变化，这取决于我是否收到其他事件。在我的用例中，我希望连续输入，但是我希望行为稳定。

有了您的水印功能Arvid，我就会得到我想要的行为，谢谢。我仍然不确定是否可以重播一批输入。我会这么认为，但我仍然会关注Watermark和EventTime。
由于它不是捆绑功能，这使我想知道我是否以错误的方式使用Flink？

作为参考，这是我最后得到的代码。

    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    env.setParallelism(1);

    // Input
    SocketTextStreamFunction source = new SocketTextStreamFunction("localhost", 9092, "\n", 0);
    SingleOutputStreamOperator<Tuple2<String, Long>> input = env.addSource(source)
      .map(new MapFunction<String, Tuple2<String, Long>>() {
        @Override
        public Tuple2<String, Long> map(String value) throws Exception {
          // Eg: 123;2019-11-29T16:03:44+01:00
          String[] split = value.split(";");
          LocalDateTime ldt = LocalDateTime.parse(split[1], DateTimeFormatter.ISO_OFFSET_DATE_TIME);
          long timestamp = ldt.atZone(ZoneOffset.systemDefault()).toInstant().toEpochMilli();
          return new Tuple2<>(split[0], timestamp);
        }
      });
    // Assign timestamp
    input = input.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessWithTimeoutTimestampExtractor(Time.milliseconds(10), Time.milliseconds(10)));
    input.print("Received");

    // Count the nb of input in the last minutes, sliding by 10s
    SingleOutputStreamOperator<Tuple2<String, Integer>> count = input
      .map(new MapFunction<Tuple2<String, Long>, Tuple2<String, Integer>>() {
        @Override
        public Tuple2<String, Integer> map(Tuple2<String, Long> x) throws Exception {
          return new Tuple2<>(x.f0, 1);
        }
      })
      .keyBy(0)
      .timeWindow(Time.minutes(1), Time.seconds(10))
      .sum(1);
    count.print("Count");

    // Connect the input and the count
    SingleOutputStreamOperator<Tuple3<String, Long, Integer>> inputWithCount = input
      .keyBy(0)
      .connect(count.keyBy(0))
      .process(
        new CoProcessFunction<Tuple2<String, Long>, Tuple2<String, Integer>, Tuple3<String, Long, Integer>>() {
          private ValueState<Integer> countCache;
          private long previousCountTimestamp;

          @Override
          public void open(Configuration parameters) throws Exception {
            ValueStateDescriptor<Integer> desc = new ValueStateDescriptor<>("count", Integer.class);
            countCache = getRuntimeContext().getState(desc);
          }

          @Override
          public void processElement1(Tuple2<String, Long> input, Context ctx,
            Collector<Tuple3<String, Long, Integer>> out) throws Exception {
            Integer cached = countCache.value();
            if (cached == null) {
              cached = 0;
            }
            out.collect(new Tuple3<>(input.f0, input.f1, cached));
          }

          @Override
          public void processElement2(Tuple2<String, Integer> count, Context ctx,
            Collector<Tuple3<String, Long, Integer>> out) throws Exception {
            countCache.update(count.f1);

            ctx.timerService().deleteEventTimeTimer(previousCountTimestamp);
            previousCountTimestamp = ctx.timestamp() + Time.minutes(1).toMilliseconds();
            ctx.timerService().registerEventTimeTimer(previousCountTimestamp);
          }

          @Override
          public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple3<String, Long, Integer>> out)
            throws Exception {
            System.out.println("Cache expires");
            countCache.clear();
          }
        });
    inputWithCount.print("Output");

    env.execute("Test");

顺便说一句，我不得不将缓存设置为过期。

输出：

echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
sleep 20s ; \
echo "456;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
sleep 1s ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092

Received> (123,1575641582000)
Received> (123,1575641582000)
Output> (123,1575641582000,0)
Output> (123,1575641582000,0)
... # few s later
Count> (123,2)
... # 10s later
Count> (123,2)
... # few s later
Received> (456,1575641602000)
Output> (456,1575641602000,0)
Received> (123,1575641603000)
Output> (123,1575641603000,2)
... # few s later
Count> (123,3)
Count> (456,1)
... # 10s later
Count> (456,1)
Count> (123,3)
... # 10s later
Count> (456,1)
Count> (123,3)
... # 10s later
Count> (123,3)
Count> (456,1)
...

Flink。活动时间。窗口处理太晚了

3 个答案: