我需要找到一些单词的数量,以使某些单词与上一批次有所区别。
示例我必须从流数据中查找字数。
计数为(test,jack)
的单词
如果我的第一批中有单词(test 5),(kite 2),(jack 3),(pen 5)
第二批数据为(test 2),(jack 1),(kite 1),(pen 1)
第二批的输出应如下:
(test 3),(jack 2) //test(5-3) and jack(2-1)
..........
JavaPairRDD<String, Integer> initialRDD=jsc.sparkContext().parallelizePairs(tuples1);
JavaPairDStream<String,Integer> joinedDstream= pairstream.transformToPair(new Function<JavaPairRDD<String, Integer>, JavaPairRDD<String, Integer>>() {
@Override
public JavaPairRDD<String, Integer> call(JavaPairRDD<String, Integer> v1) throws Exception {
JavaPairRDD<String, Integer> modRDD = v1.join(initialRDD).mapToPair(new PairFunction<Tuple2<String, Tuple2<Integer, Integer>>, String, Integer>() {
@Override
public Tuple2<String, Integer> call(Tuple2<String, Tuple2<Integer, Integer>> stringTuple2Tuple2) throws Exception {
return new Tuple2<>(stringTuple2Tuple2._1(), (stringTuple2Tuple2._2()._2 - stringTuple2Tuple2._2()._2));
}
});
return modRDD;
initialRDD=modRDD. //how to do this as lambda expressions only allow final variable.
expressions require final variable.
}
});