我有以下代码: -
case class event(imei: String, date: String, gpsdt: String, entrygpsdt: String,lastgpsdt: String)
object recalculate extends Serializable {
def main(args: Array[String]) {
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("RecalculateOdo")
.set("spark.cassandra.connection.host", "192.168.0.78")
.set("spark.cassandra.connection.keep_alive_ms", "20000")
val sc = SparkContext.getOrCreate(conf)
val rdd = sc.cassandraTable("db", "table").select("imei", "date", "gpsdt").where("imei=? and date=? and gpsdt>? and gpsdt<?", entry(0), entry(1), entry(2), entry(3))
var lastgpsdt = "2018-04-06 10:10:10"
rdd.foreach(f =>
{
val imei = f.get[String]("imei")
val date = f.get[String]("date")
val gpsdt = f.get[String]("gpsdt")
val now = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(Calendar.getInstance().getTime())
val collection = sc.parallelize(Seq(event(imei, date, gpsdt,now,lastgpsdt)))
collection.saveToCassandra("db", "table", SomeColumns("imei", "date", "gpsdt", "entrygpsdt","lastgpsdt")
lastgpsdt = gpsdt
})
}
}
每当我尝试运行代码时,都会遇到Task serializable错误: -
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
建议请,谢谢,
答案 0 :(得分:2)
SparkContext不可序列化。您应该从驱动程序本身访问它。
而不是rdd.foreach
使用rdd.map
并返回event(imei, date, gpsdt,now)
然后将此结果保存到Cassandra。类似的东西:
val eventsRdd = rdd.map { f =>
val imei = f.get[String]("imei")
val date = f.get[String]("date")
val gpsdt = f.get[String]("gpsdt")
val now = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(Calendar.getInstance().getTime())
event(imei, date, gpsdt,now)
}
eventsRdd.saveToCassandra("db", "table", SomeColumns("imei", "date", "gpsdt", "entrygpsdt"))
另外请注意,如果您有很多事件,我会考虑不创建日期格式化程序并计算每个事件的当前时间。您可以在开始计算之前执行此操作一次(或者每个分区至少执行一次 - 请参阅mapPartitions
)。