Question

流处理的教科书示例是带有时间戳的单词计数程序。带有以下数据示例

mario 10:00
luigi 10:01
mario 11:00
mario 12:00

我看过产生的字数统计程序：

总数据集

mario 3
luigi 1

一组时间窗口分区

mario 10:00-11:00 1
luigi 10:00-11:00 1
mario 11:00-12:00 1
mario 12:00-13:00 1

但是，我还没有找到滚动时间窗上的字数统计程序的示例，即我希望从时间开始每小时对每个字数产生一个字数统计：

mario 10:00-11:00 1
luigi 10:00-11:00 1
mario 11:00-12:00 2
luigi 11:00-12:00 1
mario 12:00-13:00 3
luigi 12:00-13:00 1

Apache Flink或任何其他流处理库是否有可能？谢谢！

编辑：

到目前为止，我已经尝试了大卫·安德森（David Anderson）的方法的一种变体，只是随着数据被加时间戳，才改变事件时间的处理时间。虽然没有按我预期的那样工作。这是代码，示例数据，它提供的结果以及我的后续问题：

public static void main(String[] args) throws Exception {
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
            .setParallelism(1)
            .setMaxParallelism(1);

    env.setStreamTimeCharacteristic(EventTime);


    String fileLocation = "full file path here";
    DataStreamSource<String> rawInput = env.readFile(new TextInputFormat(new Path(fileLocation)), fileLocation);

    rawInput.flatMap(parse())
            .assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks<TimestampedWord>() {
                @Nullable
                @Override
                public Watermark checkAndGetNextWatermark(TimestampedWord lastElement, long extractedTimestamp) {
                    return new Watermark(extractedTimestamp - 1);
                }

                @Override
                public long extractTimestamp(TimestampedWord element, long previousElementTimestamp) {
                    return element.getTimestamp();
                }
            })
            .keyBy(TimestampedWord::getWord)
            .process(new KeyedProcessFunction<String, TimestampedWord, Tuple3<String, Long, Long>>() {
                private transient ValueState<Long> count;

                @Override
                public void open(Configuration parameters) throws Exception {
                    count = getRuntimeContext().getState(new ValueStateDescriptor<>("counter", Long.class));
                }

                @Override
                public void processElement(TimestampedWord value, Context ctx, Collector<Tuple3<String, Long, Long>> out) throws Exception {
                    if (count.value() == null) {
                        count.update(0L);
                    }

                    long l = ((value.getTimestamp() / 10) + 1) * 10;
                    ctx.timerService().registerEventTimeTimer(l);

                    count.update(count.value() + 1);
                }

                @Override
                public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple3<String, Long, Long>> out) throws Exception {
                    long currentWatermark = ctx.timerService().currentWatermark();
                    out.collect(new Tuple3(ctx.getCurrentKey(), count.value(), currentWatermark));
                }
            })
            .addSink(new PrintlnSink());

    env.execute();
}

private static long fileCounter = 0;

private static FlatMapFunction<String, TimestampedWord> parse() {
    return new FlatMapFunction<String, TimestampedWord>() {
        @Override
        public void flatMap(String value, Collector<TimestampedWord> out) {
            out.collect(new TimestampedWord(value, fileCounter++));
        }
    };
}

private static class TimestampedWord {
    private final String word;
    private final long timestamp;

    private TimestampedWord(String word, long timestamp) {
        this.word = word;
        this.timestamp = timestamp;
    }

    public String getWord() {
        return word;
    }

    public long getTimestamp() {
        return timestamp;
    }
}

private static class PrintlnSink implements org.apache.flink.streaming.api.functions.sink.SinkFunction<Tuple3<String, Long, Long>> {
    @Override
    public void invoke(Tuple3<String, Long, Long> value, Context context) throws Exception {
        System.out.println(value.getField(0) + "=" + value.getField(1) + " at " + value.getField(2));
    }
}

对于带有以下单词的文件，每个单词都换行：

马里奥，路易吉，马里奥，马里奥，维尔玛，弗雷德，鲍勃，鲍勃，马里奥，丹，迪伦，迪伦，弗雷德，马里奥，马里奥，卡尔，班巴姆，夏季，安娜，安娜，anna，anna

产生以下输出：

mario=4 at 10
luigi=1 at 10
dan=1 at 10
bob=2 at 10
fred=1 at 10
vilma=1 at 10
dylan=2 at 20
fred=2 at 20
carl=1 at 20
anna=3 at 20
summer=1 at 20
bambam=1 at 20
mario=6 at 20
anna=7 at 9223372036854775807
edu=1 at 9223372036854775807

显然是错误的。即使在位置22之前，单词anna的第三个实例才出现，我在20时得到anna的3计数。奇怪的是，edu的确只出现在最后一个位置即使快照出现在anna的第三实例之前。即使没有消息到达（即应产生相同的数据），我如何触发每10个“时间单位”产生的快照？

如果有人能指出我正确的方向，我将非常感激！

Answer 1

是的，这不仅可以通过Flink进行，而且很容易。您可以使用KeyedProcessFunction来做到这一点，该函数将计数器保持在键控状态，直到每个单词/键到目前为止在输入流中出现的次数。然后使用计时器触发报告。

这是一个使用处理时间计时器的示例。每10秒钟打印一次报告。

public class DSExample {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env =
            StreamExecutionEnvironment.getExecutionEnvironment();

        env.addSource(new SocketTextStreamFunction("localhost", 9999, "\n", -1))
            .keyBy(x -> x)
            .process(new KeyedProcessFunction<String, String, Tuple3<Long, String, Integer>>() {
                private transient ValueState<Integer> counter;

                @Override
                public void open(Configuration parameters) throws Exception {
                    counter = getRuntimeContext().getState(new ValueStateDescriptor<>("counter", Integer.class));
                }

                @Override
                public void processElement(String s, Context context, Collector<Tuple3<Long, String, Integer>> collector) throws Exception {
                    if (counter.value() == null) {
                        counter.update(0);
                        long now = context.timerService().currentProcessingTime();
                        context.timerService().registerProcessingTimeTimer((now + 10000) - (now % 10000));
                    }
                    counter.update(counter.value() + 1);
                }

                @Override
                public void onTimer(long timestamp, OnTimerContext context, Collector<Tuple3<Long, String, Integer>> out) throws Exception {
                    long now = context.timerService().currentProcessingTime();
                    context.timerService().registerProcessingTimeTimer((now + 10000) - (now % 10000));
                    out.collect(new Tuple3(now, context.getCurrentKey(), counter.value()));
                }
            })
            .print();

        env.execute();
    }
}

已更新：

使用事件时间总是更好，但这确实增加了复杂性。大多数增加的复杂性是由于这样的事实，在实际的应用程序中，您很有可能必须处理乱序事件-在您的示例中已避免了这种情况，因此在这种情况下，我们可以轻松解决实施。

如果您更改两件事，将获得预期的结果。首先，将水印设置为extractedTimestamp - 1是结果错误的原因（例如，这就是为什么anna = 3 at 20）。如果将水印设置为extractedTimestamp，则此问题将消失。

说明：正是第三个anna的到来创建了在20点关闭窗口的水印。第三个anna的时间戳为21，因此在流中紧随其后的是一个20的水印，它将关闭。第二个窗口并生成报告anna = 3。是的，第一个edu较早到达，但它是第一个edu，其时间戳为20。在edu到达时，没有为edu设置任何计时器，并且正确地将创建的计时器设置为在30触发。直到至少有30个水印到达时，才知道edu。

另一个问题是计时器逻辑。 Flink为每个键创建一个单独的计时器，您需要在每次触发计时器时创建一个新计时器。否则，您将仅获得有关窗口中到达的单词的报告。您应该修改代码，使其更像这样：

@Override
public void processElement(TimestampedWord value, Context ctx, Collector<Tuple3<String, Long, Long>> out) throws Exception {
    if (count.value() == null) {
        count.update(0L);
        setTimer(ctx.timerService(), value.getTimestamp());
    }

    count.update(count.value() + 1);
}

@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple3<String, Long, Long>> out) throws Exception {
    long currentWatermark = ctx.timerService().currentWatermark();
    out.collect(new Tuple3(ctx.getCurrentKey(), count.value(), currentWatermark));
    if (currentWatermark < Long.MAX_VALUE) {
        setTimer(ctx.timerService(), currentWatermark);
    }
}

private void setTimer(TimerService service, long t) {
    service.registerEventTimeTimer(((t / 10) + 1) * 10);
}

有了这些更改，我得到了以下结果：

mario=4 at 10
luigi=1 at 10
fred=1 at 10
bob=2 at 10
vilma=1 at 10
dan=1 at 10
vilma=1 at 20
luigi=1 at 20
dylan=2 at 20
carl=1 at 20
bambam=1 at 20
mario=6 at 20
summer=1 at 20
anna=2 at 20
bob=2 at 20
fred=2 at 20
dan=1 at 20
fred=2 at 9223372036854775807
dan=1 at 9223372036854775807
carl=1 at 9223372036854775807
dylan=2 at 9223372036854775807
vilma=1 at 9223372036854775807
edu=1 at 9223372036854775807
anna=7 at 9223372036854775807
summer=1 at 9223372036854775807
bambam=1 at 9223372036854775807
luigi=1 at 9223372036854775807
bob=2 at 9223372036854775807
mario=6 at 9223372036854775807

现在，如果您需要实际处理乱序事件，这将变得更加复杂。有必要使水印滞后于时间戳一些实际数量，以反映流中实际存在的乱序数量，这将使得必须能够一次打开多个窗口。任何给定的事件/单词都可能不属于下一个将要关闭的窗口，因此不应增加其计数器。例如，您可以将这些“早期”事件缓冲在另一状态（例如ListState）中，或者以某种方式维护多个计数器（也许在MapState中）。此外，某些事件可能会延迟，从而使较早的报告无效，并且您想定义一些处理该事件的策略。

Flink可以按小时生成汇总/滚动/累积数据的快照吗？

1 个答案: