运行Spark作业的NullPointerException

时间:2015-01-15 12:26:50

标签: apache-spark google-hadoop

我在独立模式,版本1.2.0

上运行Spark

我正在做的第一个操作是获取文件夹路径的RDD,并生成文件名的RDD,文件名由每个文件夹中的文件组成:

JavaRDD<String> filePaths = paths.mapPartitions(new FoldersToFiles()).repartition(defaultPartitions);

FoldersToFiles类的内部实现是:

@Override
public Iterable<String> call(Iterator<String> pathsIterator) throws Exception {
    List<String> filesPath = new ArrayList<String>();
    if (pathsIterator != null) {
        while (pathsIterator.hasNext()) {
            try {
                String currFolder = pathsIterator.next();
                Path currPath = new Path(currFolder);
                FileSystem fs = FileSystem.get(currPath.toUri(), new Configuration(true));
                FileStatus[] files = fs.listStatus(currPath);
                List<FileStatus> filesList = Arrays.asList(files);
                List<String> filesPathsStr = new Utils().convertFileStatusToPath(filesList);
                filesPath.addAll(filesPathsStr);
            } catch(Exception e) {
                log.error("Error during file names extraction: " + e.getMessage());
            }
        }
    }
    if(filesPath == null || filesPath.isEmpty()) {
        log.error("Warning: files path list is null or empty!! Given Path Iterator is: " + pathsIterator.toString());
    }
    return filesPath;
}

在群集上运行作业时,出现以下错误:

520983 [task-result-getter-1] WARN org.apache.spark.scheduler.TaskSetManager  - Lost task 33.0 in stage 1.0 (TID 1033, hadoop-w-8.c.taboola-qa-01.internal): java.lang.NullPointerException
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:140)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:140)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

所以错误不是直接在我的代码中。但是,查看Spark代码中的相关行:

  /**
   * Return a new RDD by applying a function to each partition of this RDD.
   */
  def mapPartitions[U](f: FlatMapFunction[java.util.Iterator[T], U]): JavaRDD[U] = {
    def fn = (x: Iterator[T]) => asScalaIterator(f.call(asJavaIterator(x)).iterator())
    JavaRDD.fromRDD(rdd.mapPartitions(fn)(fakeClassTag[U]))(fakeClassTag[U])
  }

(第140行,其中发生异常是第一个)

它可能与(这实际上是我工作中的第一个mapPartitions,所以它有意义)上面提到的代码有关,但是我无法理解为什么。

1 个答案:

答案 0 :(得分:0)

只是预感:也许FoldersToFiles类需要声明为static(如果它是私有类)?