我无法解决filtered.foreachPartition(iter => {
触发的以下序列化问题。我虽然foreachPartition
可以解决序列化问题,但事实并非如此。那么,如何使用redisPool
?
编辑(我更新了代码以使其更清晰):
val redis_host = "localhost"
val redist_port = 6379
messages.foreachRDD(rdd => {
rdd.foreachPartition(iter => {
val redisPool = new Pool(new JedisPool(new JedisPoolConfig(), redis_host, redis_port, 2000))
iter.foreach({ msg =>
println(msg.mkString(","))
})
})
})
我假设变量redis_host
和redis_port
不可序列化,但如何正确序列化它们以便代码可以在群集上工作,而不仅仅是本地?
上面显示的代码会抛出错误:
org.apache.spark.SparkException:任务不可序列化 org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:304) 在 org.apache.spark.util.ClosureCleaner $ .ORG $阿帕奇$火花$ UTIL $ ClosureCleaner $$干净(ClosureCleaner.scala:294) 在 org.apache.spark.util.ClosureCleaner $清洁机壳(ClosureCleaner.scala:122) 在org.apache.spark.SparkContext.clean(SparkContext.scala:2055)at org.apache.spark.rdd.RDD $$ anonfun $ foreachPartition $ 1.适用(RDD.scala:919) 在 org.apache.spark.rdd.RDD $$ anonfun $ foreachPartition $ 1.适用(RDD.scala:918) 在 org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:150) 在 org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:111) 在org.apache.spark.rdd.RDD.withScope(RDD.scala:316)at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)at at org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $运行$ 1.适用(KafkaDecisionsConsumer.scala:135) 在 org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $运行$ 1.适用(KafkaDecisionsConsumer.scala:134) 在 org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ $应用MCV $ SP $ 3.apply(DStream.scala:661) 在 org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ $应用MCV $ SP $ 3.apply(DStream.scala:661) 在 org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用$ MCV $ SP(ForEachDStream.scala:50) 在 org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用(ForEachDStream.scala:50) 在 org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用(ForEachDStream.scala:50) 在 org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426) 在 org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用$ MCV $ SP(ForEachDStream.scala:49) 在 org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用(ForEachDStream.scala:49) 在 org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用(ForEachDStream.scala:49) 在scala.util.Try $ .apply(Try.scala:161)at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)at at org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用$ MCV $ SP(JobScheduler.scala:224) 在 org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用(JobScheduler.scala:224) 在 org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用(JobScheduler.scala:224) 在scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)at org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler.run(JobScheduler.scala:223) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 在 java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:617) 在java.lang.Thread.run(Thread.java:745)引起: java.io.NotSerializableException: org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer 序列化堆栈: - 对象不可序列化(类:org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer,value: org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer@3fba5c74) - field(class:org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1, name:$ outer,type:class org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer) - object(类org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1, ) - field(class:org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1 $$ anonfun $ apply $ 1, name:$ outer,type:class org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $运行$ 1) - object(类org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1 $$ anonfun $ apply $ 1, ) 在 org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:40) 在 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) 在 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) 在 org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:301) ... 30多个线程中的异常" main" org.apache.spark.SparkException:任务不可序列化 org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:304) 在 org.apache.spark.util.ClosureCleaner $ .ORG $阿帕奇$火花$ UTIL $ ClosureCleaner $$干净(ClosureCleaner.scala:294) 在 org.apache.spark.util.ClosureCleaner $清洁机壳(ClosureCleaner.scala:122) 在org.apache.spark.SparkContext.clean(SparkContext.scala:2055)at org.apache.spark.rdd.RDD $$ anonfun $ foreachPartition $ 1.适用(RDD.scala:919) 在 org.apache.spark.rdd.RDD $$ anonfun $ foreachPartition $ 1.适用(RDD.scala:918) 在 org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:150) 在 org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:111) 在org.apache.spark.rdd.RDD.withScope(RDD.scala:316)at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)at at org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $运行$ 1.适用(KafkaDecisionsConsumer.scala:135) 在 org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $运行$ 1.适用(KafkaDecisionsConsumer.scala:134) 在 org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ $应用MCV $ SP $ 3.apply(DStream.scala:661) 在 org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ $应用MCV $ SP $ 3.apply(DStream.scala:661) 在 org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用$ MCV $ SP(ForEachDStream.scala:50) 在 org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用(ForEachDStream.scala:50) 在 org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用(ForEachDStream.scala:50) 在 org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426) 在 org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用$ MCV $ SP(ForEachDStream.scala:49) 在 org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用(ForEachDStream.scala:49) 在 org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用(ForEachDStream.scala:49) 在scala.util.Try $ .apply(Try.scala:161)at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)at at org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用$ MCV $ SP(JobScheduler.scala:224) 在 org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用(JobScheduler.scala:224) 在 org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用(JobScheduler.scala:224) 在scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)at org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler.run(JobScheduler.scala:223) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 在 java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:617) 在java.lang.Thread.run(Thread.java:745)引起: java.io.NotSerializableException: org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer 序列化堆栈: - 对象不可序列化(类:org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer,value: org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer@3fba5c74) - field(class:org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1, name:$ outer,type:class org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer) - object(类org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1, ) - field(class:org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1 $$ anonfun $ apply $ 1, name:$ outer,type:class org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $运行$ 1) - object(类org.test.manager.service.consumer.kafka.KafkaDecisionsConsumer $$ anonfun $ run $ 1 $$ anonfun $ apply $ 1, ) 在 org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:40) 在 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) 在 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) 在 org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:301)
答案 0 :(得分:0)
解决方案是在匿名函数内懒惰地初始化池。在java中你可以这样做:
messages.foreachRDD(new RedisFunction(redis_host, redis_port))
messages.count()
class RedisFunction implements F {
private Pool pool = null;
private final String redis_host;
private final String redis_port;
RedisFunction(String redis_host, String redis_port) {
this.redis_host = redis_host;
this.redis_port = redis_port;
initPool();
}
private void initPool() {
this.pool = new Pool(new JedisPool(new JedisPoolConfig(), redis_host, redis_port, 2000))
}
public Void call(JavaRDD<> rdd) {
if(this.pool == null) {
initPool();
}
rdd = rdd.map(....); /*your rdd transformations go here*/
rdd.count(); //spark action
}
}
以上java示例应该可以帮助您修复序列化问题。