运行Spark Streaming作业时出现序列化问题

时间:2016-11-18 20:03:04

标签: scala serialization apache-spark

我无法解决filtered.foreachPartition(iter => {触发的以下序列化问题。我虽然foreachPartition可以解决序列化问题,但事实并非如此。那么,如何使用redisPool

编辑(我更新了代码以使其更清晰):

val redis_host = "localhost"
val redist_port = 6379
messages.foreachRDD(rdd => {
  rdd.foreachPartition(iter => {
    val redisPool = new Pool(new JedisPool(new JedisPoolConfig(), redis_host, redis_port, 2000))
    iter.foreach({ msg =>
      println(msg.mkString(","))
    })
  })
})

我假设变量redis_hostredis_port不可序列化,但如何正确序列化它们以便代码可以在群集上工作,而不仅仅是本地?

上面显示的代码会抛出错误:

  

org.apache.spark.SparkException:任务不可序列化   org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:304)     在   org.apache.spark.util.ClosureCleaner $ .ORG $阿帕奇$火花$ UTIL $ ClosureCleaner $$干净(ClosureCleaner.scala:294)     在   org.apache.spark.util.ClosureCleaner $清洁机壳(ClosureCleaner.scala:122)     在org.apache.spark.SparkContext.clean(SparkContext.scala:2055)at   org.apache.spark.rdd.RDD $$ anonfun $ foreachPartition $ 1.适用(RDD.scala:919)     在   org.apache.spark.rdd.RDD $$ anonfun $ foreachPartition $ 1.适用(RDD.scala:918)     在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:150)     在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:111)     在org.apache.spark.rdd.RDD.withScope(RDD.scala:316)at   org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)at at   org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $运行$ 1.适用(KafkaDecisionsConsumer.scala:135)     在   org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $运行$ 1.适用(KafkaDecisionsConsumer.scala:134)     在   org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ $应用MCV $ SP $ 3.apply(DStream.scala:661)     在   org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ $应用MCV $ SP $ 3.apply(DStream.scala:661)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用$ MCV $ SP(ForEachDStream.scala:50)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用(ForEachDStream.scala:50)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用(ForEachDStream.scala:50)     在   org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用$ MCV $ SP(ForEachDStream.scala:49)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用(ForEachDStream.scala:49)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用(ForEachDStream.scala:49)     在scala.util.Try $ .apply(Try.scala:161)at   org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)at at   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用$ MCV $ SP(JobScheduler.scala:224)     在   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用(JobScheduler.scala:224)     在   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用(JobScheduler.scala:224)     在scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)at   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler.run(JobScheduler.scala:223)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:617)     在java.lang.Thread.run(Thread.java:745)引起:   java.io.NotSerializableException:   org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer   序列化堆栈:      - 对象不可序列化(类:org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer,value:   org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer@3fba5c74)      - field(class:org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1,   name:$ outer,type:class   org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer)      - object(类org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1,   )      - field(class:org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1 $$ anonfun $ apply $ 1,   name:$ outer,type:class   org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $运行$ 1)      - object(类org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1 $$ anonfun $ apply $ 1,   ) 在   org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:40)     在   org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)     在   org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)     在   org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:301)     ... 30多个线程中的异常" main"   org.apache.spark.SparkException:任务不可序列化   org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:304)     在   org.apache.spark.util.ClosureCleaner $ .ORG $阿帕奇$火花$ UTIL $ ClosureCleaner $$干净(ClosureCleaner.scala:294)     在   org.apache.spark.util.ClosureCleaner $清洁机壳(ClosureCleaner.scala:122)     在org.apache.spark.SparkContext.clean(SparkContext.scala:2055)at   org.apache.spark.rdd.RDD $$ anonfun $ foreachPartition $ 1.适用(RDD.scala:919)     在   org.apache.spark.rdd.RDD $$ anonfun $ foreachPartition $ 1.适用(RDD.scala:918)     在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:150)     在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:111)     在org.apache.spark.rdd.RDD.withScope(RDD.scala:316)at   org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)at at   org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $运行$ 1.适用(KafkaDecisionsConsumer.scala:135)     在   org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $运行$ 1.适用(KafkaDecisionsConsumer.scala:134)     在   org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ $应用MCV $ SP $ 3.apply(DStream.scala:661)     在   org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ $应用MCV $ SP $ 3.apply(DStream.scala:661)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用$ MCV $ SP(ForEachDStream.scala:50)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用(ForEachDStream.scala:50)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用(ForEachDStream.scala:50)     在   org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用$ MCV $ SP(ForEachDStream.scala:49)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用(ForEachDStream.scala:49)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用(ForEachDStream.scala:49)     在scala.util.Try $ .apply(Try.scala:161)at   org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)at at   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用$ MCV $ SP(JobScheduler.scala:224)     在   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用(JobScheduler.scala:224)     在   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用(JobScheduler.scala:224)     在scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)at   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler.run(JobScheduler.scala:223)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:617)     在java.lang.Thread.run(Thread.java:745)引起:   java.io.NotSerializableException:   org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer   序列化堆栈:      - 对象不可序列化(类:org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer,value:   org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer@3fba5c74)      - field(class:org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1,   name:$ outer,type:class   org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer)      - object(类org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1,   )      - field(class:org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1 $$ anonfun $ apply $ 1,   name:$ outer,type:class   org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $运行$ 1)      - object(类org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1 $$ anonfun $ apply $ 1,   ) 在   org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:40)     在   org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)     在   org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)     在   org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:301)

1 个答案:

答案 0 :(得分:0)

解决方案是在匿名函数内懒惰地初始化池。在java中你可以这样做:

messages.foreachRDD(new RedisFunction(redis_host, redis_port))
messages.count()

class RedisFunction implements F {
    private Pool pool = null;
    private final String redis_host;
    private final String redis_port;

    RedisFunction(String redis_host, String redis_port) {
        this.redis_host = redis_host;
        this.redis_port = redis_port;
        initPool();
    }
    private void initPool() {
        this.pool = new Pool(new JedisPool(new JedisPoolConfig(), redis_host, redis_port, 2000))
    }
    public Void call(JavaRDD<> rdd) {
        if(this.pool == null) {
            initPool();
        }
        rdd = rdd.map(....);  /*your rdd transformations go here*/
        rdd.count();   //spark action
    }
}

以上java示例应该可以帮助您修复序列化问题。