用于滚动聚合的GlobalWindow的替代方法

时间:2018-07-16 22:22:23

标签: apache-flink flink-streaming

我想知道Flink是否适合以下用例。假设我有一个测量流(device_id,值),例如

(1,10.2),(2,3.4),(3,9.1),(1,7.0),(3,6.3),(5,17.8)

我想每分钟报告到目前为止已看到的device_id的最新值。

给出数据:

data:  (1, 10.2), (2, 3.4), (3, 9.1), (1, 7.0), (3, 6.3), (5, 17.8)

time: 0 ----------------- 1min -------------- 2min ------------------ 3min

我想要一个结果:

1:{(1,10.2),(2,3.4)}

2:{(1,7.0),(2,3.4),(3,9.1)}

3:{(1,7.0),(2,3.4),(3,6.3),(5,17.8)}

我想出了包含

的实现
.windowAll(GlobalWindows.create()).trigger(CountTrigger.of(1)).apply( ... ) 

,但是在大型数据集上看起来不好(在内存方面)。还有另一种方法吗?

1 个答案:

答案 0 :(得分:0)

您可能希望将类似以下内容作为起点:

public class StreamingJob {
  private static final TimeUnit windowTimeUnit = TimeUnit.SECONDS;
  private static final long windowLength = 10;

  private static long getNearestRightBoundaryFor(Long timestamp, Long duration, TimeUnit unit){
    Long durationEpoch = unit.toMillis(duration);
    Long quotient = timestamp / durationEpoch;
    return (quotient + 1) * durationEpoch - 1;
  }

  public static void main(String[] args) throws Exception {

    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    env.fromElements(
            Tuple3.of(1000L, 1L, 3.8f), Tuple3.of(2003L, 2L, 82.3f), Tuple3.of(3006L, 1L, 4.2f), // 0 - 09
            Tuple3.of(11120L, 2L, 10f), Tuple3.of(12140L, 2L, 7.15f), Tuple3.of(13150L, 3L, 3.33f), // 10 - 19
            Tuple3.of(21200L, 2L, 1.09f), Tuple3.of(22270L, 1L, 2.22f), Tuple3.of(23280L, 2L, 3.8f), // 20 - 29
            Tuple3.of(31310L, 3L, 3.12f), Tuple3.of(32330L, 2L, 9.2f), Tuple3.of(33390L, 1L, 4.0f) // 30 - 39
    )
    .assignTimestampsAndWatermarks(
            new AssignerWithPunctuatedWatermarks<Tuple3<Long,Long,Float>>() {
                @Nullable
                @Override
                public Watermark checkAndGetNextWatermark(Tuple3<Long, Long, Float> lastElement, long extractedTimestamp) {
                    return new Watermark(extractedTimestamp);
                }

                @Override
                public long extractTimestamp(Tuple3<Long, Long, Float> element, long previousElementTimestamp) {
                    return element.f0;
                }
            })
    .keyBy(new KeySelector<Tuple3<Long,Long,Float>, Long>() {
        @Override
        public Long getKey(Tuple3<Long, Long, Float> value) throws Exception {
            return value.f1;
        }
    })
    .process(new KeyedProcessFunction<Long, Tuple3<Long,Long,Float>, Tuple4<Long, Long, Long, Float>>() {
        private ValueState<Tuple3<Long, Long, Float>> state;

        @Override
        public void open(Configuration parameters) {
            ValueStateDescriptor<Tuple3<Long, Long, Float>> descriptor = new ValueStateDescriptor<>(
                    "state",
                    TypeInformation.of(new TypeHint<Tuple3<Long, Long, Float>>() {
                    }));

            state = getRuntimeContext().getState(descriptor);
        }

        @Override
        public void processElement(Tuple3<Long, Long, Float> value, Context ctx, Collector<Tuple4<Long, Long, Long, Float>> out) throws Exception {
            Tuple3<Long, Long, Float> currentValue = state.value();
            if (currentValue == null) {
                Long ts = getNearestRightBoundaryFor(value.f0, windowLength, windowTimeUnit);
                ctx.timerService().registerEventTimeTimer(ts);
                state.update(value);
            }
            else if (value.f0 > currentValue.f0) { // ignore out-of-order events
                state.update(value);
            }
        }

        @Override
        public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple4<Long, Long, Long, Float>> out) throws IOException {
            Tuple3<Long, Long, Float> currentValue = state.value();
            out.collect(new Tuple4(timestamp, currentValue.f0, currentValue.f1, currentValue.f2));
            Long newTs = timestamp + windowTimeUnit.toMillis(windowLength);
            if (ctx.timerService().currentWatermark() < Long.MAX_VALUE) {
                ctx.timerService().registerEventTimeTimer(newTs);
            }
        }
    })
    .print();
    env.execute("Flink FTW!");
  }
}

有些事情要指出:

我不建议为此使用Windows。使用GlobalWindows,管理到期状态变得很复杂。

我使用了AssignerWithPunctuatedWatermarks而不是AscendingTimestampExtractor。我这样做的原因有三个:(1)一旦切换到并行运行,可能很难确保事件按顺序到达; (2)AscendingTimestampExtractors定期生成水印(默认情况下,每200毫秒实时生成一次),对于本示例而言,该应用在生成第一个水印之前已经消耗了其所有输入; (3)处理不正常事件所需的全部工作就是对processElement方法的简单检查。但是,如果事件确实按顺序进行,则最好在生产中使用AscendingTimestampExtractor或BoundedOutOfOrdernessTimestampExtractor。

输出看起来像这样:

(9999,11120,2,10.0)
(19999,21200,2,1.09)
(19999,13150,3,3.33)
(29999,23280,2,3.8)
(29999,31310,3,3.12)
(39999,32330,2,9.2)
(39999,31310,3,3.12)
(9999,3006,1,4.2)
(19999,3006,1,4.2)
(29999,22270,1,2.22)
(39999,33390,1,4.0)

之所以在9999处触发(11120,2,10.0),是因为带有时间戳11120的此事件的到来使水印超过了9999,从而触发了该计时器。到调用onTimer时,onElement已经被调用。

在onTimer中检查ctx.timerService()。currentWatermark()