是否可以在Spark Streaming转换函数内创建广播变量

时间:2018-10-18 02:57:17

标签: apache-spark spark-streaming

我尝试使用从数据库获取的一些参数来创建可恢复的Spark流作业。但是然后我遇到了一个问题:当我尝试从检查点重新启动作业时,总是出现序列化错误。

  

18/10/18 09:54:33错误执行程序:阶段56.0(TID 132)中的任务1.0中的异常java.lang.ClassCastException:org.apache.spark.util.SerializableConfiguration无法转换为   scala.collection.MapLike位于   com.ptnj.streaming.alertJob.InputDataParser $ .kafka_stream_handle(InputDataParser.scala:37)   在   com.ptnj.streaming.alertJob.InstanceAlertJob $$ anonfun $ 1.apply(InstanceAlertJob.scala:38)   在   com.ptnj.streaming.alertJob.InstanceAlertJob $$ anonfun $ 1.apply(InstanceAlertJob.scala:38)   在scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:410)在   scala.collection.Iterator $$ anon $ 13.hasNext(Iterator.scala:463)在   scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:409)在   scala.collection.Iterator $$ anon $ 13.hasNext(Iterator.scala:462)在   scala.collection.Iterator $$ anon $ 12.hasNext(Iterator.scala:440)在   scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:409)在   org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)   在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)   在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)   在org.apache.spark.scheduler.Task.run(Task.scala:99)在   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:282)   在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)   在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)   在java.lang.Thread.run(Thread.java:748)

在这个existing SO question中,我遵循了马克西姆G的建议,这似乎有所帮助。

但是现在有另一个例外。由于这个问题,我必须 在进行流转换时创建广播变量,例如

 val kafka_data_streaming = stream.map(x => DstreamHandle.kafka_stream_handle(url, x.value(), sc))

因此必须将sparkcontext作为参数放入 转换函数,然后它发生:

  

线程“ main”中的异常org.apache.spark.SparkException:任务无法在以下位置进行序列化   org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:298)   在   org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $ ClosureCleaner $$ clean(ClosureCleaner.scala:288)   在   org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.scala:108)   在org.apache.spark.SparkContext.clean(SparkContext.scala:2094)处   org.apache.spark.streaming.dstream.DStream $$ anonfun $ map $ 1.apply(DStream.scala:546)   在   org.apache.spark.streaming.dstream.DStream $$ anonfun $ map $ 1.apply(DStream.scala:546)   在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)   在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)   在org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
  在   org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:264)   在org.apache.spark.streaming.dstream.DStream.map(DStream.scala:545)
  在   com.ptnj.streaming.alertJob.InstanceAlertJob $ .streaming_main(InstanceAlertJob.scala:38)   在com.ptnj.streaming.AlarmMain $ .create_ssc(AlarmMain.scala:36)在   com.ptnj.streaming.AlarmMain $ .main(AlarmMain.scala:14)在   com.ptnj.streaming.AlarmMain.main(AlarmMain.scala)由以下原因引起:   java.io.NotSerializableException:org.apache.spark.SparkContext   序列化堆栈:     -无法序列化的对象(类:org.apache.spark.SparkContext,值:org.apache.spark.SparkContext@5fb7183b)     -字段(类:com.ptnj.streaming.alertJob.InstanceAlertJob $$ anonfun $ 1,名称:sc $ 1,   类型:类org.apache.spark.SparkContext)     -对象(com.ptnj.streaming.alertJob.InstanceAlertJob $$ anonfun $ 1,类)   在   org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:40)   在   org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)   在   org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)   在   org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:295)   ...另外14个

我以前从未见过这种情况。每个示例都显示广播变量将在输出操作函数中创建,而不是在转换函数中创建,这有可能吗?

0 个答案:

没有答案