我正在使用Flink的DataStream API实现连通组件算法,因为还没有使用此API实现它。
对于这个算法,我通过翻滚窗口分离数据。因此,对于每个窗口,我都试图独立地计算算法。
我的问题来自算法的迭代特性。我实现了我想要的交互数据管道(步骤数据管道),它包括FlatMaps,1 Join,1 ProcessWindow和1 Filter。但是,似乎我想反馈循环的流实际上并没有反馈到循环的开头,因为算法不会迭代。我怀疑如果原始迭代数据流与另一个流连接(即使后者是由前者的flatMap创建),则不可能这样做。
我使用的代码如下:
//neigborsList = Datastream of <Vertex, [List of neighbors], label>
IterativeStream< Tuple3<Integer, ArrayList<Integer>, Integer> > beginning_loop = neigborsList.iterate(maxTimeout);
//Emits tuples Vertices and Labels for every vertex and its neighbors
DataStream<Tuple2<Integer,Integer> > labels = beginning_loop
//Datastream of <Vertex, label> for every neigborsList.f0 and element in neigborsList.f1
.flatMap( new EmitVertexLabel() )
.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.milliseconds(windowSize)))
.minBy(1)
;
DataStream<Tuple4<Integer, ArrayList<Integer>, Integer, Integer>> updatedVertex = beginning_loop
//Update vertex label with the results from the labels reduction
.join(labels)
.where("vertex")
.equalTo("vertex")
.window(TumblingEventTimeWindows.of(Time.milliseconds(windowSize)))
.apply(new JoinFunction<Tuple3<Integer,ArrayList<Integer>,Integer>, Tuple2<Integer,Integer>, Tuple4<Integer,ArrayList<Integer>,Integer,Integer>>() {
@Override
public Tuple4<Integer,ArrayList<Integer>,Integer,Integer> join(
Tuple3<Integer, ArrayList<Integer>, Integer> arg0, Tuple2<Integer, Integer> arg1)
throws Exception {
int hasConverged = 1;
if(arg1.f1.intValue() < arg0.f2.intValue() )
{
arg0.f2 = arg1.f1;
hasConverged=0;
}
return new Tuple4<>(arg0.f0,arg0.f1,arg0.f2,new Integer(hasConverged));
}
})
//Disseminates the convergence flag if a change was made in the window
.windowAll(TumblingEventTimeWindows.of(Time.milliseconds(windowSize)))
.process(new ProcessAllWindowFunction<Tuple4<Integer,ArrayList<Integer>,Integer,Integer>,Tuple4<Integer, ArrayList<Integer>, Integer, Integer>,TimeWindow >() {
@Override
public void process(
ProcessAllWindowFunction<Tuple4<Integer, ArrayList<Integer>, Integer, Integer>, Tuple4<Integer, ArrayList<Integer>, Integer, Integer>, TimeWindow>.Context ctx,
Iterable<Tuple4<Integer, ArrayList<Integer>, Integer, Integer>> values,
Collector<Tuple4<Integer, ArrayList<Integer>, Integer, Integer>> out) throws Exception {
Iterator<Tuple4<Integer, ArrayList<Integer>, Integer, Integer>> iterator = values.iterator();
Tuple4<Integer, ArrayList<Integer>, Integer, Integer> element;
int hasConverged= 1;
while(iterator.hasNext())
{
element = iterator.next();
if(element.f3.intValue()>0)
{
hasConverged=0;
break;
}
}
//Re iterate and emit the values on the correct output
iterator = values.iterator();
Integer converged = new Integer(hasConverged);
while(iterator.hasNext())
{
element = iterator.next();
element.f3 = converged;
out.collect(element);
}
}
})
;
DataStream<Tuple3<Integer, ArrayList<Integer>, Integer>> feed_back = updatedVertex
.filter(new NotConvergedFilter())
//Remove the finished convergence flag
//Transforms the Tuples4 to Tuples3 so that it becomes compatible with beginning_loop
.map(new RemoveConvergeceFlag())
;
beginning_loop.closeWith(feed_back);
//Selects the windows that have already converged
DataStream<?> convergedWindows = updatedVertex
.filter(new ConvergedFilter() );
convergedWindows.print()
.setParallelism(1)
.name("Sink to stdout");
在执行结束时,convergedWindows没有收到任何tupple(因为算法只能在1次迭代时收敛)。 如果我打印了begin_loop,我会看到初始tupples和来自第一次迭代的feed_back结果的tupples。但是,除此之外别无其他。
那么,总结一下我的问题,这可能是Flink的限制吗?如果是这样,您是否知道在初始缩减后更新顶点标签的另一种方法,一种不基于连接的方式?
PS。我正在使用Flink 1.3.3