我们从kafka读取数据,消息可以简化为Tuple2,这里String是键,Integer是类型(可以是1,2,3),类似,
('key001', 1)
('key001', 2)
('key001', 3)
('key001', 3)
('key002', 1)
('key002', 2)
('key003', 1)
('key004', 1)
我们希望在10分钟的时间内获取一些统计信息
我已经尝试过下面的代码,似乎可以用,但是对我来说却是扭曲的,这是正确的方法吗?
我必须在这里使用两次时间窗口,因为使用一个时间窗口没有提供我想要的东西,我仍然不清楚它的工作方式,谁能解释将多个时间窗口应用于流时发生的情况?
SingleOutputStreamOperator<Tuple2<String, Long>> x = ds.keyBy(0)
.timeWindow(Time.seconds(600))
.process(new ProcessWindowFunction<Tuple2<String, Integer>, Tuple2<String, Long>, Tuple, TimeWindow>() {
@Override
public void process(Tuple key,
ProcessWindowFunction<Tuple2<String, Integer>, Tuple2<String, Long>, Tuple, TimeWindow>.Context ctx,
Iterable<Tuple2<String, Integer>> elements, Collector<Tuple2<String, Long>> out) throws Exception {
boolean hasType1 = false;
boolean hasType2 = false;
boolean hasType3 = false;
for (Tuple2<String, Integer> t2 : elements) {
if (t2.f1 == 1) {
if (!hasType1) {
hasType1 = true;
}
} else if (t2.f1 == 2) {
if (!hasType2) {
hasType2 = true;
}
} else if (t2.f1 == 3) {
if (!hasType3) {
hasType3 = true;
}
}
//
if (hasType1 && hasType2 && hasType3) {
break;
}
}
if (hasType1) {
out.collect(new Tuple2<>("hasType1",1L));
if (hasType2) {
out.collect(new Tuple2<>("hasType1_Type2",1L));
if (hasType3) {
out.collect(new Tuple2<>("hasType1_Type2_Type3",1L));
}
}
}
}
});
x.keyBy(0).timeWindow(Time.seconds(600)).sum(1).map(new MapFunction<Tuple2<String, Long>, String>(){
@Override
public String map(Tuple2<String, Long> value) throws Exception {
return value.f0 + " = " + value.f1;
}
}).addSink(new BucketingSink<String>("hdfs://...")).setParallelism(1);