我正在尝试在spark dstream中为每个RDD添加最大值和最小值。每个RD的元组。我编写了以下代码,但无法理解如何传递参数min和max。 谁能建议一种方法来进行这种转变? 我尝试了以下方法:
JavaPairDStream<Tuple2<Long, Integer>, Tuple3<Integer,Long,Long>> sortedtsStream = transformedMaxMintsStream.transformToPair(new Sort2());
class MinMax implements Function<JavaPairRDD<Tuple2<Long, Integer>, Integer>, JavaPairRDD<Tuple2<Long, Integer>, Tuple3<Integer, Long, Long>>>{
Long max;
Long min;
@Override
public JavaPairRDD<Tuple2<Long, Integer>, Tuple3<Integer, Long, Long>> call(JavaPairRDD<Tuple2<Long, Integer>, Integer> input) throws Exception {
JavaPairRDD<Tuple2<Long,Integer>,Tuple3<Integer,Long,Long>> output;
max = input.max(new CMP1())._1._1;
min = input.min(new CMP1())._1._1;
output = input.mapToPair(new maptoMinMax());
return output ;
}
class maptoMinMax implements PairFunction<Tuple2<Tuple2<Long, Integer>, Integer>, Tuple2<Long, Integer>, Tuple3<Integer, Long, Long>> {
@Override
public Tuple2<Tuple2<Long, Integer>, Tuple3<Integer, Long, Long>> call(Tuple2<Tuple2<Long, Integer>, Integer> tuple2IntegerTuple2) throws Exception {
return new Tuple2<Tuple2<Long, Integer>, Tuple3<Integer, Long, Long>>(new Tuple2<Long, Integer>(tuple2IntegerTuple2._1._1,tuple2IntegerTuple2._1._2), new Tuple3<Integer, Long, Long>(tuple2IntegerTuple2._2, max,min));
}
}
}
我收到以下错误:基本上似乎找不到JavaPairRDD的min和max函数
15/06/18 11:05:06 INFO BlockManagerInfo: Added input-0-1434639906000 in memory on localhost:42829 (size: 464.0 KB, free: 264.9 MB)
15/06/18 11:05:06 INFO BlockGenerator: Pushed block input-0-1434639906000
Exception in thread "JobGenerator" java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2;
at org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:346)
at org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:340)
at org.apache.spark.streaming.api.java.JavaDStreamLike$class.scalaTransform$3(JavaDStreamLike.scala:360)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361)
at org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21.apply(DStream.scala:654)
at org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21.apply(DStream.scala:654)
at org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5.apply(DStream.scala:668)
at org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5.apply(DStream.scala:666)
at org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:41)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStrea
答案 0 :(得分:1)
我们可以使用rdd.transform
在同一个RDD上应用多个操作来获得每个批处理间隔的结果。我们将此结果添加到每个元组(根据问题规范)
data.transform{rdd =>
val mx = rdd.map(x=> (x,x)).reduce{case ((x1,x2),(y1,y2)) => ((x1 min y1), (x2 max y2))}
rdd.map(elem => (elem,mx))
}
这会产生RDD每个块间隔,如(1到999之间的随机数):
(258,(0,998))(591,(0,998))...
Java版本在语义上是相同的,但由于所有那些元组&lt; ...&gt;所以更加冗长。对象。