以下代码在套接字上接收消息,通过1分钟的窗口(滑动10秒)对消息进行计数,然后用缓存的计数压缩输入。
处理是事件时间。我收到的消息包含我要用于处理的时间戳。
这接近培训练习:Reveal
const settings = {
dots: true,
arrows: false,
fade: true,
swipeToSlide: false,
infinite: false,
slidesToShow: 1,
slidesToScroll: 1,
adaptiveHeight: true,
speed: 200,
customPaging(i) {
return (
<button
className={`${questions[i].answer && 'active'}`}
type="button"
/>
);
},
responsive: [
{
breakpoint: theme.media.tablet,
settings: {
swipeToSlide: true
}
}
],
beforeChange: (previous, next) =>
this.setState({ oldSlide: previous, activeSlide: next }),
afterChange: previous => this.setState({ activeSlide2: previous })
};
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
// Input
SocketTextStreamFunction source = new SocketTextStreamFunction("localhost", 9092, "\n", 0);
SingleOutputStreamOperator<Tuple2<String, Long>> input = env.addSource(source)
.map(x -> {
// Eg: 123;2019-11-29T16:03:44+01:00
String[] split = x.split(";");
LocalDateTime ldt = LocalDateTime.parse(split[1], DateTimeFormatter.ISO_OFFSET_DATE_TIME);
long timestamp = ldt.atZone(ZoneOffset.systemDefault()).toInstant().toEpochMilli();
return new Tuple2<>(split[0], timestamp);
});
// Assign timestamp
input = input.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor<Tuple2<String, Long>>(Time.milliseconds(100)) {
@Override
public long extractTimestamp(Tuple2<String, Long> element) {
return element.f1;
}
});
input.print("Received");
// Count the nb of input in the last minutes, sliding by 10s
SingleOutputStreamOperator<Tuple2<String, Integer>> count = input
.map(x -> new Tuple2<>(x.f0, 1))
.keyBy(0)
.timeWindow(Time.minutes(1), Time.seconds(10))
.sum(1);
count.print("Count");
// Connect the input and the count
SingleOutputStreamOperator inputWithCount = input
.keyBy(0)
.connect(count.keyBy(0))
.process(
new CoProcessFunction<Tuple2<String, Long>, Tuple2<String, Integer>, Tuple3<String, Long, Integer>>() {
private ValueState<Integer> countCache;
@Override
public void open(Configuration parameters) throws Exception {
ValueStateDescriptor<Integer> desc = new ValueStateDescriptor<>("count", Integer.class);
countCache = getRuntimeContext().getState(desc);
}
@Override
public void processElement1(Tuple2<String, Long> value, Context ctx, Collector<Tuple3<String, Long, Integer>> out) throws Exception {
Integer cached = countCache.value();
if (cached == null) {
cached = 0;
}
out.collect(new Tuple3<>(value.f0, value.f1, cached));
}
@Override
public void processElement2(Tuple2<String, Integer> value, Context ctx, Collector<Tuple3<String, Long, Integer>> out) throws Exception {
countCache.update(value.f1);
}
});
inputWithCount.print("Output");
env.execute("Test");
// I did not include the import, and I pretty-print the Map function for clarity
现在,当我发送2行时,请等待20秒再发送另一行。我希望第2个输入的计数值为0,第三个输入的计数为2。 我对第一个期望是正确的,而不是第二个期望。
# Start server:
ncat -lk --broker 9092
# Check what's received:
nc localhost 9092
# I run the Flink app, and use the following command
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
sleep 20s ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092
我希望在输出第3个元素之前已经处理完计数。 我误会了活动时间吗?还是我的代码做错了什么?
答案 0 :(得分:2)
(跟随David Anderson的解释,并提供其他解决方案,请先阅读他的帖子)。
如果您的示例接近实际数据(大量滞后),则还可以选择引入某种空闲超时。对于某些用例,这也是处理空的Kafka分区的推荐方法。
public static class BoundedOutOfOrdernessWithTimeoutTimestampExtractor
implements AssignerWithPeriodicWatermarks<FakeKafkaRecord> {
private static final long serialVersionUID = 1L;
private final long maxOutOfOrderness;
private final long idle;
private long recordTimestamp;
BoundedOutOfOrdernessWithTimeoutTimestampExtractor(Time maxOutOfOrderness, Time idle) {
this.maxOutOfOrderness = maxOutOfOrderness.toMilliseconds();
this.idle = idle.toMilliseconds();
}
@Nullable
@Override
public Watermark getCurrentWatermark() {
return new Watermark(Math.max(recordTimestamp - maxOutOfOrderness, System.currentTimeMillis() - idle));
}
@Override
public long extractTimestamp(FakeKafkaRecord record, long previousElementTimestamp) {
return recordTimestamp = record.getTimestamp();
}
}
根据您的水印间隔查询时间戳分配器。
env.getConfig().setAutoWatermarkInterval(100);
如果BoundedOutOfOrdernessWithTimeoutTimestampExtractor
在idle
期间未收到事件,它将相应地推进水印。您可能想将idle
设置为maxOutOfOrderness
(100毫秒)。
答案 1 :(得分:1)
问题是,您没有做任何事情来保证在发出第3个元素之前就已经对计数进行了处理-实际上,几乎可以肯定不会。
这样做的原因是,当前水印不能充分前进,以触发窗口,直到第三个事件到达。没关系,您已经等待了20秒的实时时间-重要的是,没有事件通过时间戳提取器,因此没有推进水印的依据。
此外,@Data
是一个周期性的水印生成器,默认情况下,每200毫秒仅创建一次新的水印。这意味着您的第3个事件很可能在生成触发窗口的水印之前由BoundedOutOfOrdernessTimestampExtractor
处理。
如果您切换到标点水印生成器,则可以获得更多确定性的水印-但是水印仍将跟随第3个事件,因此仍然不会产生您期望的结果。
答案 2 :(得分:1)
谢谢大卫和阿维德!
直到现在我还不了解的是,事件进入系统时会生成水印(“处理时间”是自动的,并遵循服务器时钟)。 而且无论如何,源将变为空闲状态,将不会再发生任何事情。 这正是doc中写的内容,但我错过了。
在以下特定情况下,我得到了期望的输出:
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
sleep 20s ; \
echo "456;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
sleep 1s ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092
Received> (123,1575640032000)
Received> (123,1575640032000)
Output> (123,1575640032000,0)
Output> (123,1575640032000,0)
... # 20s later
Received> (456,1575640052000)
Output> (456,1575640052000,0)
Count> (123,2)
Count> (123,2)
... # 1s later
Received> (123,1575640053000)
Output> (123,1575640053000,2)
我发现我的输出可能会有所变化,这取决于我是否收到其他事件。在我的用例中,我希望连续输入,但是我希望行为稳定。
有了您的水印功能Arvid,我就会得到我想要的行为,谢谢。我仍然不确定是否可以重播一批输入。我会这么认为,但我仍然会关注Watermark和EventTime。
由于它不是捆绑功能,这使我想知道我是否以错误的方式使用Flink?
作为参考,这是我最后得到的代码。
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
// Input
SocketTextStreamFunction source = new SocketTextStreamFunction("localhost", 9092, "\n", 0);
SingleOutputStreamOperator<Tuple2<String, Long>> input = env.addSource(source)
.map(new MapFunction<String, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> map(String value) throws Exception {
// Eg: 123;2019-11-29T16:03:44+01:00
String[] split = value.split(";");
LocalDateTime ldt = LocalDateTime.parse(split[1], DateTimeFormatter.ISO_OFFSET_DATE_TIME);
long timestamp = ldt.atZone(ZoneOffset.systemDefault()).toInstant().toEpochMilli();
return new Tuple2<>(split[0], timestamp);
}
});
// Assign timestamp
input = input.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessWithTimeoutTimestampExtractor(Time.milliseconds(10), Time.milliseconds(10)));
input.print("Received");
// Count the nb of input in the last minutes, sliding by 10s
SingleOutputStreamOperator<Tuple2<String, Integer>> count = input
.map(new MapFunction<Tuple2<String, Long>, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> map(Tuple2<String, Long> x) throws Exception {
return new Tuple2<>(x.f0, 1);
}
})
.keyBy(0)
.timeWindow(Time.minutes(1), Time.seconds(10))
.sum(1);
count.print("Count");
// Connect the input and the count
SingleOutputStreamOperator<Tuple3<String, Long, Integer>> inputWithCount = input
.keyBy(0)
.connect(count.keyBy(0))
.process(
new CoProcessFunction<Tuple2<String, Long>, Tuple2<String, Integer>, Tuple3<String, Long, Integer>>() {
private ValueState<Integer> countCache;
private long previousCountTimestamp;
@Override
public void open(Configuration parameters) throws Exception {
ValueStateDescriptor<Integer> desc = new ValueStateDescriptor<>("count", Integer.class);
countCache = getRuntimeContext().getState(desc);
}
@Override
public void processElement1(Tuple2<String, Long> input, Context ctx,
Collector<Tuple3<String, Long, Integer>> out) throws Exception {
Integer cached = countCache.value();
if (cached == null) {
cached = 0;
}
out.collect(new Tuple3<>(input.f0, input.f1, cached));
}
@Override
public void processElement2(Tuple2<String, Integer> count, Context ctx,
Collector<Tuple3<String, Long, Integer>> out) throws Exception {
countCache.update(count.f1);
ctx.timerService().deleteEventTimeTimer(previousCountTimestamp);
previousCountTimestamp = ctx.timestamp() + Time.minutes(1).toMilliseconds();
ctx.timerService().registerEventTimeTimer(previousCountTimestamp);
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple3<String, Long, Integer>> out)
throws Exception {
System.out.println("Cache expires");
countCache.clear();
}
});
inputWithCount.print("Output");
env.execute("Test");
顺便说一句,我不得不将缓存设置为过期。
输出:
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
sleep 20s ; \
echo "456;$(date -Iseconds)" | nc 0.0.0.0 9092 ; \
sleep 1s ; \
echo "123;$(date -Iseconds)" | nc 0.0.0.0 9092
Received> (123,1575641582000)
Received> (123,1575641582000)
Output> (123,1575641582000,0)
Output> (123,1575641582000,0)
... # few s later
Count> (123,2)
... # 10s later
Count> (123,2)
... # few s later
Received> (456,1575641602000)
Output> (456,1575641602000,0)
Received> (123,1575641603000)
Output> (123,1575641603000,2)
... # few s later
Count> (123,3)
Count> (456,1)
... # 10s later
Count> (456,1)
Count> (123,3)
... # 10s later
Count> (456,1)
Count> (123,3)
... # 10s later
Count> (123,3)
Count> (456,1)
...