我遇到了一类在批处理中不存在的问题,但对于流媒体案例来说似乎并不重要。让我们考虑一下经典的单词计数示例:
lines
.flatMap(_.split("\\W+"))
.map(word => (word, 1))
.keyBy(0)
.sum(1)
这将打印流中每个单词的结果,例如:
input: "foo bar baz foo"
output: (foo, 1) (bar, 1) (baz, 1) (foo, 2)
我要做的是将每一行作为一个整体处理,然后打印结果,即在每一行上使用一个窗口:
input: "foo bar baz foo"
output: (foo, 2) (bar, 1) (baz, 1)
显然,基于时间和基于计数的窗口都不适用于此。解决这个问题的正确方法是什么?
答案 0 :(得分:0)
即使在批处理模式下也无法并行处理单词和行,因为Flink不支持嵌套的groupBy
(或keyBy
)。但是,如果您想要以下批次字数的流式版本:
lines
.flatMap(line => (lineId,word,1))
.groupBy(0)
.reduceGroup {aggregateWords}
其中aggregateWords
迭代该特定键的单词并对其进行计数,然后您可以通过以下方式实现它:对于每行,您在最后发出单词以及特殊记录,然后使用具有自定义触发器的GlobalWindow,一旦收到特殊记录就会触发。
上一批作业的流式版本可能如下所示:
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.fromElements("foo bar baz foo", "yes no no yes", "hi hello hi hello")
.flatMap(new FlatMapFunction<String, Tuple3<Double, String, Integer>>() {
@Override
public void flatMap(String s, Collector<Tuple3<Double, String, Integer>> collector) throws Exception {
String[] words = s.split("\\W+");
Double lineId = Math.random();
for (String w : words) {
collector.collect(Tuple3.of(lineId, w, 1));
}
collector.collect(Tuple3.of(lineId, "\n", 1));
}
})
.keyBy(0)
.window(GlobalWindows.create())
.trigger(new Trigger<Tuple3<Double, String, Integer>, GlobalWindow>() {
@Override
public TriggerResult onElement(Tuple3<Double, String, Integer> element, long l, GlobalWindow globalWindow, TriggerContext triggerContext) throws Exception {
if (element.f1.equals("\n")) {
return TriggerResult.FIRE;
}
return TriggerResult.CONTINUE;
}
@Override
public TriggerResult onProcessingTime(long l, GlobalWindow globalWindow, TriggerContext triggerContext) throws Exception {
return TriggerResult.CONTINUE;
}
@Override
public TriggerResult onEventTime(long l, GlobalWindow globalWindow, TriggerContext triggerContext) throws Exception {
return TriggerResult.CONTINUE;
}
})
.fold(new HashMap<>(), new FoldFunction<Tuple3<Double, String, Integer>, HashMap<String, Integer>>() {
@Override
public HashMap<String, Integer> fold(HashMap<String, Integer> hashMap, Tuple3<Double, String, Integer> tuple3) throws Exception {
if (!tuple3.f1.equals("\n")) {
hashMap.put(tuple3.f1, hashMap.getOrDefault(tuple3.f1, 0) + 1);
}
return hashMap;
}
}).print();
env.execute("Test");
}
输出:
{bar=1, foo=2, baz=1}
{no=2, yes=2}
{hi=2, hello=2}