我们正在使用基于Spark Streaming Receiver的方法,我们只是启用了Check指向以消除数据丢失问题。
Spark版本为1.6.1
,我们正在接收来自Kafka主题的消息。
我在ssc
内使用foreachRDD
DStream
方法,因此它会抛出Not Serializable异常。
我尝试扩展Serializable类,但仍然是同样的错误。它只在我们启用检查点时才会发生。
def main(args: Array[String]): Unit = {
val checkPointLocation = "/path/to/wal"
val ssc = StreamingContext.getOrCreate(checkPointLocation, () => createContext(checkPointLocation))
ssc.start()
ssc.awaitTermination()
}
def createContext (checkPointLocation: String): StreamingContext ={
val sparkConf = new SparkConf().setAppName("Test")
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(sparkConf, Seconds(40))
ssc.checkpoint(checkPointLocation)
val sc = ssc.sparkContext
val sqlContext: SQLContext = new HiveContext(sc)
val kafkaParams = Map("group.id" -> groupId,
CommonClientConfigs.SECURITY_PROTOCOL_CONFIG -> sasl,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
"metadata.broker.list" -> brokerList,
"zookeeper.connect" -> zookeeperURL)
val dStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicMap, StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
dStream.foreachRDD(rdd =>
{
// using sparkContext / sqlContext to do any operation throws error.
// convert RDD[String] to RDD[Row]
//Create Schema for the RDD.
sqlContext.createDataFrame(rdd, schema)
})
ssc
}
错误日志:
2017-02-08 22:53:53,250 ERROR [Driver] streaming.StreamingContext: 启动上下文时出错,将其标记为已停止 java.io.NotSerializableException:DStream检查点已经存在 启用但DStreams及其功能不可序列化 org.apache.spark.SparkContext序列化堆栈: - 对象不可序列化(类:org.apache.spark.SparkContext,值: org.apache.spark.SparkContext@1c5e3677) - field(class:com.x.payments.RemedyDriver $$ anonfun $ main $ 1,name:sc $ 1,type:class org.apache.spark.SparkContext) - object(类com.x.payments.RemedyDriver $$ anonfun $ main $ 1,) - field(类:org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ apply $ mcV $ sp $ 3, name:cleaningF $ 1,输入:interface scala.Function1) - object(类org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ apply $ mcV $ sp $ 3, ) - writeObject数据(类:org.apache.spark.streaming.dstream.DStream) - object(类org.apache.spark.streaming.dstream.ForEachDStream, org.apache.spark.streaming.dstream.ForEachDStream@68866c5) - 数组元素(索引:0) - array(class [Ljava.lang.Object;,size 16) - field(类:scala.collection.mutable.ArrayBuffer,name:array,type:class [Ljava.lang.Object;) - object(类scala.collection.mutable.ArrayBuffer,ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@68866c5)) - writeObject数据(类:org.apache.spark.streaming.dstream.DStreamCheckpointData) - object(类org.apache.spark.streaming.dstream.DStreamCheckpointData,[0 检查点文件
]) - writeObject数据(类:org.apache.spark.streaming.dstream.DStream) - object(类org.apache.spark.streaming.kafka.KafkaInputDStream, org.apache.spark.streaming.kafka.KafkaInputDStream@acd8e32) - 数组元素(索引:0) - array(class [Ljava.lang.Object;,size 16) - field(类:scala.collection.mutable.ArrayBuffer,name:array,type:class [Ljava.lang.Object;) - object(类scala.collection.mutable.ArrayBuffer,ArrayBuffer(org.apache.spark.streaming.kafka.KafkaInputDStream@acd8e32)) - writeObject数据(类:org.apache.spark.streaming.DStreamGraph) - object(类org.apache.spark.streaming.DStreamGraph,org.apache.spark.streaming.DStreamGraph@6935641e) - field(class:org.apache.spark.streaming.Checkpoint,name:graph,type:class org.apache.spark.streaming.DStreamGraph) - object(类org.apache.spark.streaming.Checkpoint,org.apache.spark.streaming.Checkpoint@484bf033) 在org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:557) 在org.apache.spark.streaming.StreamingContext.liftedTree1 $ 1(StreamingContext.scala:601) 在org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600) 在com.x.payments.RemedyDriver $ .main(RemedyDriver.scala:104) 在com.x.payments.RemedyDriver.main(RemedyDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) 在org.apache.spark.deploy.yarn.ApplicationMaster $$ anon $ 2.run(ApplicationMaster.scala:559) 2017-02-08 22:53:53,250 ERROR [Driver] payments.RemedyDriver $:DStream 检查点已启用,但DStreams具有其功能 是不可序列化的org.apache.spark.SparkContext序列化 堆: - 对象不可序列化(类:org.apache.spark.SparkContext,值: org.apache.spark.SparkContext@1c5e3677) - field(class:com.x.payments.RemedyDriver $$ anonfun $ main $ 1,name:sc $ 1,type:class org.apache.spark.SparkContext) - object(类com.x.payments.RemedyDriver $$ anonfun $ main $ 1,) - field(类:org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ apply $ mcV $ sp $ 3, name:cleaningF $ 1,输入:interface scala.Function1) - object(类org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ apply $ mcV $ sp $ 3, ) - writeObject数据(类:org.apache.spark.streaming.dstream.DStream) - object(类org.apache.spark.streaming.dstream.ForEachDStream, org.apache.spark.streaming.dstream.ForEachDStream@68866c5) - 数组元素(索引:0) - array(class [Ljava.lang.Object;,size 16) - field(类:scala.collection.mutable.ArrayBuffer,name:array,type:class [Ljava.lang.Object;) - object(类scala.collection.mutable.ArrayBuffer,ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@68866c5)) - writeObject数据(类:org.apache.spark.streaming.dstream.DStreamCheckpointData) - object(类org.apache.spark.streaming.dstream.DStreamCheckpointData,[0 检查点文件
]) - writeObject数据(类:org.apache.spark.streaming.dstream.DStream) - object(类org.apache.spark.streaming.kafka.KafkaInputDStream, org.apache.spark.streaming.kafka.KafkaInputDStream@acd8e32) - 数组元素(索引:0) - array(class [Ljava.lang.Object;,size 16) - field(类:scala.collection.mutable.ArrayBuffer,name:array,type:class [Ljava.lang.Object;) - object(类scala.collection.mutable.ArrayBuffer,ArrayBuffer(org.apache.spark.streaming.kafka.KafkaInputDStream@acd8e32)) - writeObject数据(类:org.apache.spark.streaming.DStreamGraph) - object(类org.apache.spark.streaming.DStreamGraph,org.apache.spark.streaming.DStreamGraph@6935641e) - field(class:org.apache.spark.streaming.Checkpoint,name:graph,type:class org.apache.spark.streaming.DStreamGraph) - object(类org.apache.spark.streaming.Checkpoint,org.apache.spark.streaming.Checkpoint@484bf033)2017-02-08 22:53:53,255 INFO [Driver] yarn.ApplicationMaster:最终应用状态: SUCCEEDED,exitCode:0
更新
基本上我们要做的是,将rdd转换为DF [在DStream的foreachRDD方法中],然后在其上应用DF API,最后将数据存储在Cassandra中。所以我们使用sqlContext将rdd转换为DF,那时它会抛出错误。
答案 0 :(得分:3)
如果您想访问SparkContext
,请通过rdd
值执行此操作:
dStream.foreachRDD(rdd => {
val sqlContext = new HiveContext(rdd.context)
val dataFrameSchema = sqlContext.createDataFrame(rdd, schema)
}
此:
dStream.foreachRDD(rdd => {
// using sparkContext / sqlContext to do any operation throws error.
val numRDD = sc.parallelize(1 to 10, 2)
log.info("NUM RDD COUNT:"+numRDD.count())
}
导致SparkContext
在闭包中序列化,但由于它不可序列化,因此无法序列化。