Scala任务不可序列化

时间:2018-04-10 06:26:27

标签: scala apache-spark exception serialization spark-cassandra-connector

我有以下代码: -

case class event(imei: String, date: String, gpsdt: String,  entrygpsdt: String,lastgpsdt: String)

object recalculate extends Serializable {
def main(args: Array[String]) {
  val conf = new SparkConf()
  .setMaster("local[2]")
  .setAppName("RecalculateOdo")
  .set("spark.cassandra.connection.host", "192.168.0.78")
  .set("spark.cassandra.connection.keep_alive_ms", "20000")

 val sc = SparkContext.getOrCreate(conf)

 val rdd = sc.cassandraTable("db", "table").select("imei", "date", "gpsdt").where("imei=? and date=? and gpsdt>? and gpsdt<?", entry(0), entry(1), entry(2), entry(3))
var lastgpsdt = "2018-04-06 10:10:10"
 rdd.foreach(f => 
      {

      val imei = f.get[String]("imei")
      val date = f.get[String]("date")
      val gpsdt = f.get[String]("gpsdt")
      val now = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(Calendar.getInstance().getTime())
      val collection = sc.parallelize(Seq(event(imei, date, gpsdt,now,lastgpsdt)))
      collection.saveToCassandra("db", "table", SomeColumns("imei", "date", "gpsdt", "entrygpsdt","lastgpsdt")
      lastgpsdt = gpsdt
      })
 }
}

每当我尝试运行代码时,都会遇到Task serializable错误: -

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)

建议请,谢谢,

1 个答案:

答案 0 :(得分:2)

SparkContext不可序列化。您应该从驱动程序本身访问它。 而不是rdd.foreach使用rdd.map并返回event(imei, date, gpsdt,now) 然后将此结果保存到Cassandra。类似的东西:

val eventsRdd = rdd.map { f => 
  val imei = f.get[String]("imei")
  val date = f.get[String]("date")
  val gpsdt = f.get[String]("gpsdt")
  val now = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(Calendar.getInstance().getTime())
  event(imei, date, gpsdt,now)
}
eventsRdd.saveToCassandra("db", "table", SomeColumns("imei", "date", "gpsdt", "entrygpsdt"))

另外请注意,如果您有很多事件,我会考虑不创建日期格式化程序并计算每个事件的当前时间。您可以在开始计算之前执行此操作一次(或者每个分区至少执行一次 - 请参阅mapPartitions)。