我想知道Flink是否适合以下用例。假设我有一个测量流(device_id,值),例如
(1,10.2),(2,3.4),(3,9.1),(1,7.0),(3,6.3),(5,17.8)
我想每分钟报告到目前为止已看到的device_id的最新值。
给出数据:
data: (1, 10.2), (2, 3.4), (3, 9.1), (1, 7.0), (3, 6.3), (5, 17.8)
time: 0 ----------------- 1min -------------- 2min ------------------ 3min
我想要一个结果:
1:{(1,10.2),(2,3.4)}
2:{(1,7.0),(2,3.4),(3,9.1)}
3:{(1,7.0),(2,3.4),(3,6.3),(5,17.8)}
我想出了包含
的实现.windowAll(GlobalWindows.create()).trigger(CountTrigger.of(1)).apply( ... )
,但是在大型数据集上看起来不好(在内存方面)。还有另一种方法吗?
答案 0 :(得分:0)
您可能希望将类似以下内容作为起点:
public class StreamingJob {
private static final TimeUnit windowTimeUnit = TimeUnit.SECONDS;
private static final long windowLength = 10;
private static long getNearestRightBoundaryFor(Long timestamp, Long duration, TimeUnit unit){
Long durationEpoch = unit.toMillis(duration);
Long quotient = timestamp / durationEpoch;
return (quotient + 1) * durationEpoch - 1;
}
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.fromElements(
Tuple3.of(1000L, 1L, 3.8f), Tuple3.of(2003L, 2L, 82.3f), Tuple3.of(3006L, 1L, 4.2f), // 0 - 09
Tuple3.of(11120L, 2L, 10f), Tuple3.of(12140L, 2L, 7.15f), Tuple3.of(13150L, 3L, 3.33f), // 10 - 19
Tuple3.of(21200L, 2L, 1.09f), Tuple3.of(22270L, 1L, 2.22f), Tuple3.of(23280L, 2L, 3.8f), // 20 - 29
Tuple3.of(31310L, 3L, 3.12f), Tuple3.of(32330L, 2L, 9.2f), Tuple3.of(33390L, 1L, 4.0f) // 30 - 39
)
.assignTimestampsAndWatermarks(
new AssignerWithPunctuatedWatermarks<Tuple3<Long,Long,Float>>() {
@Nullable
@Override
public Watermark checkAndGetNextWatermark(Tuple3<Long, Long, Float> lastElement, long extractedTimestamp) {
return new Watermark(extractedTimestamp);
}
@Override
public long extractTimestamp(Tuple3<Long, Long, Float> element, long previousElementTimestamp) {
return element.f0;
}
})
.keyBy(new KeySelector<Tuple3<Long,Long,Float>, Long>() {
@Override
public Long getKey(Tuple3<Long, Long, Float> value) throws Exception {
return value.f1;
}
})
.process(new KeyedProcessFunction<Long, Tuple3<Long,Long,Float>, Tuple4<Long, Long, Long, Float>>() {
private ValueState<Tuple3<Long, Long, Float>> state;
@Override
public void open(Configuration parameters) {
ValueStateDescriptor<Tuple3<Long, Long, Float>> descriptor = new ValueStateDescriptor<>(
"state",
TypeInformation.of(new TypeHint<Tuple3<Long, Long, Float>>() {
}));
state = getRuntimeContext().getState(descriptor);
}
@Override
public void processElement(Tuple3<Long, Long, Float> value, Context ctx, Collector<Tuple4<Long, Long, Long, Float>> out) throws Exception {
Tuple3<Long, Long, Float> currentValue = state.value();
if (currentValue == null) {
Long ts = getNearestRightBoundaryFor(value.f0, windowLength, windowTimeUnit);
ctx.timerService().registerEventTimeTimer(ts);
state.update(value);
}
else if (value.f0 > currentValue.f0) { // ignore out-of-order events
state.update(value);
}
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple4<Long, Long, Long, Float>> out) throws IOException {
Tuple3<Long, Long, Float> currentValue = state.value();
out.collect(new Tuple4(timestamp, currentValue.f0, currentValue.f1, currentValue.f2));
Long newTs = timestamp + windowTimeUnit.toMillis(windowLength);
if (ctx.timerService().currentWatermark() < Long.MAX_VALUE) {
ctx.timerService().registerEventTimeTimer(newTs);
}
}
})
.print();
env.execute("Flink FTW!");
}
}
有些事情要指出:
我不建议为此使用Windows。使用GlobalWindows,管理到期状态变得很复杂。
我使用了AssignerWithPunctuatedWatermarks而不是AscendingTimestampExtractor。我这样做的原因有三个:(1)一旦切换到并行运行,可能很难确保事件按顺序到达; (2)AscendingTimestampExtractors定期生成水印(默认情况下,每200毫秒实时生成一次),对于本示例而言,该应用在生成第一个水印之前已经消耗了其所有输入; (3)处理不正常事件所需的全部工作就是对processElement方法的简单检查。但是,如果事件确实按顺序进行,则最好在生产中使用AscendingTimestampExtractor或BoundedOutOfOrdernessTimestampExtractor。
输出看起来像这样:
(9999,11120,2,10.0)
(19999,21200,2,1.09)
(19999,13150,3,3.33)
(29999,23280,2,3.8)
(29999,31310,3,3.12)
(39999,32330,2,9.2)
(39999,31310,3,3.12)
(9999,3006,1,4.2)
(19999,3006,1,4.2)
(29999,22270,1,2.22)
(39999,33390,1,4.0)
之所以在9999处触发(11120,2,10.0),是因为带有时间戳11120的此事件的到来使水印超过了9999,从而触发了该计时器。到调用onTimer时,onElement已经被调用。
在onTimer中检查ctx.timerService()。currentWatermark()