在调用由combineByKey函数生成的rdd之后,调用collect()不会返回

时间:2016-11-10 13:35:51

标签: java apache-spark

免责声明:我是Spark的新手

我有一个rdd,看起来像:

[(T,[Tina, Thomas]), (T,[Tolis]), (C,[Cory, Christine]), (J,[Joseph, Jimmy, James, Jackeline, Juan]), (J,[Jimbo, Jina])]

我调用combineByKey并得到一个JavaPairRDD<字符,整数>

这个调用似乎工作正常(控制流从这一点开始,在调试器中foo似乎有某种价值)

JavaPairRDD<Character, Integer> foo = rdd.combineByKey(createAcc, addAndCount, combine);
System.out.println(foo.collect());

我的问题是程序在调用foo.collect()后没有返回; 你有什么想法 ?我尝试使用eclipse调试器进行调试,但我根本没有机会

我正在使用Spark版本2.0.0和Java 8

编辑:combineByKey调用的函数的代码如下(它显然是一个虚拟代码,因为我是新来的火花,我的目标是调用 combineByKey用于查找每个Key所属的字符串列表的总长度:

            Function<Iterable<String>, Integer> createAcc =

            new Function<Iterable<String>, Integer>() {

                    public Integer call(Iterable<String> x) {
                            int counter = 0;
                            Iterator<String> it = x.iterator();
                            while (it.hasNext()) {
                                    counter++;
                            }
                            return counter;
                    }
            };

            Function2<Integer, Iterable<String>, Integer> addAndCount =

            new Function2<Integer,Iterable<String>, Integer>() {

                    public Integer call(Integer acc , Iterable<String> x) {
                            int counter = 0;
                            Iterator<String> it = x.iterator();
                            while (it.hasNext()) {
                                    counter++;
                            }
                            return counter + acc;
                    }
            };

            Function2<Integer,Integer,Integer> combine =

            new Function2<Integer,Integer, Integer>() {

                    public Integer call(Integer x, Integer y) {
                            return x+y;
                    }
            };

UPDATE2:请求的日志如下

16/11/11 17:21:32 INFO SparkContext: Starting job: count at Foo.java:265 16/11/11 17:21:32 INFO DAGScheduler: Got job 9 (count at Foo.java:265) with 3 output partitions 16/11/11 17:21:32 INFO DAGScheduler: Final stage: ResultStage 20 (count at Foo.java:265) 16/11/11 17:21:32 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 19, ShuffleMapStage 18) 16/11/11 17:21:32 INFO DAGScheduler: Missing parents: List() 16/11/11 17:21:32 INFO DAGScheduler: Submitting ResultStage 20 (MapPartitionsRDD[24] at combineByKey at Foo.java:264), which has no missing parents 16/11/11 17:21:32 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 6.7 KB, free 1946.0 MB) 16/11/11 17:21:32 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 3.4 KB, free 1946.0 MB) 16/11/11 17:21:32 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on xxx.xxx.xx.xx:55712 (size: 3.4 KB, free: 1946.1 MB) 16/11/11 17:21:32 INFO SparkContext: Created broadcast 12 from broadcast at DAGScheduler.scala:1012 16/11/11 17:21:32 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 20 (MapPartitionsRDD[24] at combineByKey at Foo.java:264) 16/11/11 17:21:32 INFO TaskSchedulerImpl: Adding task set 20.0 with 3 tasks 16/11/11 17:21:32 INFO TaskSetManager: Starting task 0.0 in stage 20.0 (TID 30, localhost, partition 0, ANY, 5288 bytes) 16/11/11 17:21:32 INFO Executor: Running task 0.0 in stage 20.0 (TID 30) 16/11/11 17:21:32 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 3 blocks 16/11/11 17:21:32 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms

1 个答案:

答案 0 :(得分:2)

这是一个简单的Java问题:你的“while”循环永远不会调用it.next,永远不会结束。

将其更改为

    while (it.hasNext()) {
      it.next();
      counter++;
    }