Question

使用Java编写Spark应用程序时，我遇到以下代码问题：

public class BatchLayerDefaultJob implements Serializable {

private static Function <BatchLayerProcessor, Future> batchFunction = new Function<BatchLayerProcessor, Future>() {
    @Override
    public Future call(BatchLayerProcessor s) {
        return executor.submit(s);
    }
};
public void applicationRunner(BatchParameters batchParameters) {


 SparkConf sparkConf = new SparkConf().setAppName("Platform Engine - Batch Job");
 sparkConf.set("spark.driver.allowMultipleContexts", "true");
 sparkConf.set("spark.cores.max", "1");
 JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
 List<BatchLayerProcessor> batchListforRDD = new ArrayList<BatchLayerProcessor>();

// populate List here.... Then attempt to process below

JavaRDD<BatchLayerProcessor> distData = sparkContext.parallelize(batchListforRDD, batchListforRDD.size());
JavaRDD<Future> result = distData.map(batchFunction);
result.collect(); // <-- Produces an object not serializable exception here

所以我尝试了许多无用的事情，包括将batchFunction作为主类影响之外的单独类提取，我也尝试使用mapPartitions而不是map。我或多或少都没有想法。任何帮助表示赞赏。

下面的堆栈跟踪：

17/11/30 17:11:28 INFO DAGScheduler: Job 0 failed: collect at 
BatchLayerDefaultJob.java:122, took 23.406561 s
Exception in thread "Thread-8" org.apache.spark.SparkException: Job aborted due to stage failure: Failed to serialize task 0, not attempting to retry it. Exception during serialization: 
java.io.NotSerializableException: xxxx.BatchLayerProcessor
Serialization stack:
- object not serializable (class: xxxx.BatchLayerProcessor, value: xxxx.BatchLayerProcessor@3e745097)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1)
- field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.WrappedArray$ofRef, WrappedArray(xxxx.BatchLayerProcessor@3e745097))
- writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition)
- object (class org.apache.spark.rdd.ParallelCollectionPartition, org.apache.spark.rdd.ParallelCollectionPartition@691)
- field (class: org.apache.spark.scheduler.ResultTask, name: partition, type: interface org.apache.spark.Partition)
- object (class org.apache.spark.scheduler.ResultTask, ResultTask(0, 0))
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)

干杯。

编辑::按要求添加了BatchLayerProcessor - 稍微截断：

public class BatchLayerProcessor implements Runnable, Serializable {
private int interval, backMinutes;
private String scoreVal, batchjobid;
private static CountDownLatch countDownLatch;
 public void run() {
    /* Get a reference to the ApplicationContextReader, a singleton*/
    ApplicationContextReader applicationContextReaderCopy = ApplicationContextReader.getInstance();

    synchronized (BatchLayerProcessor.class) /* Protect singleton member variable from multithreaded access. */ {
        if (applicationContextReader == null) /* If local reference is null...*/
            applicationContextReader = applicationContextReaderCopy; /* ...set it to the singleton */
    }

    if (getxScoreVal().equals("")) {
        applicationContextReader.getScoreService().calculateScores(applicationContextReader.getFunctions(), getInterval(), getBackMinutes(), getScoreVal(), true, getTimeInterval(), getIncludes(), getExcludes());
    }
    else {
        applicationContextReader.getScoreService().calculateScores(applicationContextReader.getFunctions(), getInterval(), getBackMinutes(), getScoreVal(), true, getTimeInterval(), getIncludes(), getExcludes());
    }

    countDownLatch.countDown();
}

Answer 1

决定更改BatchLayerProcessor以使其不可运行，而是依靠Spark为我做那项工作。

Spark / Java可序列化问题 - org.apache.spark.SparkException：任务不可序列化

1 个答案: