我正在使用JavaPairRDD的mapPartitionstoPair函数,如下所示:
JavaPairRDD<MyKeyClass, MyValueClass> myRDD;
JavaPairRDD<Integer, Double> myResult = myRDD.mapPartitionsToPair(new PairFlatMapFunction<Iterator<Tuple2<MyKeyClass,MyValueClass>>, Integer, Double>(){
public Iterable<Tuple2<MyInteger, MyDouble>> call(Iterator<Tuple2<MyKeyClass, MyValueClass>> arg0) throws Exception {
Tuple2<MyKeyClass, MyValueClass> temp = arg0.next(); //The error is coming here...
TreeMap<Integer, Double> dic = new TreeMap<Integer, Double>();
do{
........
// Some Code to compute to newIntegerValue and newDoubleValue from temp
........
dic.put(newIntegerValue, newDoubleValue)
temp = arg0.next();
}while(arg0.hasNext());
}
}
我可以在Apache Spark伪分布式模式下运行它。我无法在我的群集上运行上面的代码。我收到以下错误:
java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:30)
at IncrementalGraph$6.call(MySparkJob.java:584)
at IncrementalGraph$6.call(MySparkJob.java:573)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$9$1.apply(JavaRDDLike.scala:186)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$9$1.apply(JavaRDDLike.scala:186)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
我使用Spark 1.2.0而不是Hadoop 2.2.0。
任何人都可以帮我解决这个问题吗?
更新: hasNext()在迭代器
上调用next()之前给出true答案 0 :(得分:1)
我找到了答案。
我将myRDD存储级别设为MEMORY_ONLY。在mapPartitonsToPair转换开始之前,我的代码中有以下行:
myRDD.persist(StorageLevel.MEMORY_ONLY());
我删除了它并修复了程序。
我不知道为什么修好它。如果有人能解释,请高度赞赏。
答案 1 :(得分:0)
您的代码假设将传入的所有迭代器都有一些元素,但事实并非如此。某些分区可以为空(特别是对于小的测试数据集)。它是一个非常常见的模式,只是检查迭代器是否为空并返回一个空迭代器,如果在mapPartitions
代码的开头就是这种情况。希望有所帮助:)