我正在尝试将流数据保存到卡夫卡的cassandra中。我能够读取和解析数据但是当我在下面的行中调用以保存数据时,我得到Task not Serializable
异常。我的课程正在扩展可序列化,但不确定为什么我看到这个错误,谷歌搜索3小时后没有得到太多的帮助,有些机构可以提供任何指示吗?
val collection = sc.parallelize(Seq((obj.id, obj.data)))
collection.saveToCassandra("testKS", "testTable ", SomeColumns("id", "data"))`
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka.KafkaUtils
import com.datastax.spark.connector._
import kafka.serializer.StringDecoder
import org.apache.spark.rdd.RDD
import com.datastax.spark.connector.SomeColumns
import java.util.Formatter.DateTime
object StreamProcessor extends Serializable {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("StreamProcessor")
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
val sqlContext = new SQLContext(sc)
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topics = args.toSet
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
stream.foreachRDD { rdd =>
if (!rdd.isEmpty()) {
try {
rdd.foreachPartition { iter =>
iter.foreach {
case (key, msg) =>
val obj = msgParseMaster(msg)
val collection = sc.parallelize(Seq((obj.id, obj.data)))
collection.saveToCassandra("testKS", "testTable ", SomeColumns("id", "data"))
}
}
}
}
}
ssc.start()
ssc.awaitTermination()
}
import org.json4s._
import org.json4s.native.JsonMethods._
case class wordCount(id: Long, data: String) extends serializable
implicit val formats = DefaultFormats
def msgParseMaster(msg: String): wordCount = {
val m = parse(msg).extract[wordCount]
return m
}
}
我正在
org.apache.spark.SparkException:任务不可序列化
以下是完整日志
16/08/06 10:24:52错误JobScheduler:运行作业流作业时出错1470504292000 ms.0 org.apache.spark.SparkException:任务不可序列化 在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:304) 在org.apache.spark.util.ClosureCleaner $ .org $ apache $ spark $ util $ ClosureCleaner $$ clean(ClosureCleaner.scala:294) 在org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.scala:122) 在org.apache.spark.SparkContext.clean(SparkContext.scala:2055) 在org.apache.spark.rdd.RDD $$ anonfun $ foreachPartition $ 1.apply(RDD.scala:919) 在org.apache.spark.rdd.RDD $$ anonfun $ foreachPartition $ 1.apply(RDD.scala:918) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:150) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:111) 在org.apache.spark.rdd.RDD.withScope(RDD.scala:316) 在org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918) 在
答案 0 :(得分:2)
SparkContext
不可序列化,您无法在foreachRDD
内使用它,并且使用图表时您不需要它。相反,您可以简单地映射每个RDD,解析相关数据并将新RDD保存到cassandra:
stream
.map {
case (_, msg) =>
val result = msgParseMaster(msg)
(result.id, result.data)
}
.foreachRDD(rdd => if (!rdd.isEmpty)
rdd.saveToCassandra("testKS",
"testTable",
SomeColumns("id", "data")))
答案 1 :(得分:1)
你不能在传递给sc.parallelize
的函数中调用foreachPartition
- 该函数必须被序列化并发送给每个执行者,并且SparkContext
(有意)不可序列化(它应该只存在于驱动程序应用程序中,而不是执行程序中。