org.apache.spark.SparkException:Spark Scala中不可序列化的任务

时间:2016-12-14 10:41:18

标签: scala apache-spark cassandra

我想得到; employeeId;来自; employee_table;并使用此id查询employee_address表以获取地址。

表没有问题。但是我运行以下代码,我得到org.apache.spark.SparkException: Task not serializable

我想我知道这个问题。问题是sparkContext是master而不是worker。但我不知道如何解决这个问题。

val employeeRDDRdd = sc.cassandraTable("local_keyspace", "employee_table")


try {

  val data = employeeRDDRdd
    .map(row => {
      row.getStringOption("employeeID") match {
        case Some(s) if (s != null) && s.nonEmpty => s
        case None => ""
      }
    })

    //create tuple of employee id and address. Filtering out cases when  for an employee address is empty.

  val id = data
    .map(s => (s,getID(s)))
    filter(tups => tups._2.nonEmpty)

    //printing out total size of rdd.
    println(id.count())




} catch {
  case e: Exception => e.printStackTrace()
}

def getID(employeeID: String): String = {
  val addressRDD = sc.cassandraTable("local_keyspace", "employee_address")
  val data = addressRDD.map(row => row.getStringOption("address") match {
    case Some(s) if (s != null) && s.nonEmpty => s
    case None => ""
  })
  data.collect()(0)
}

例外 ==>

rg.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2039)
    at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
    at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
    at org.apache.spark.rdd.RDD.map(RDD.scala:365)

1 个答案:

答案 0 :(得分:3)

由Lambda中捕获的SparkContext引起的序列化错误

序列化问题由

引起
val addressRDD = sc.cassandraTable("local_keyspace", "employee_address")

此部分在序列化lambda中使用:

val id = data
  .map(s => (s,getID(s)))

所有RDD转换代表远程执行的代码,这意味着它们的整个内容必须是可序列化的。

Spark Context不是可序列化的,但是对于" getIDs"工作,所以有一个例外。基本规则是,您无法触及任何SparkContext转化中的RDD

如果您实际上尝试加入cassandra中的数据,那么您有几个选择。

如果您只是根据分区键

拉行

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable

如果您正尝试加入某个其他字段

单独加载两个RDD并执行Spark Join

val leftrdd = sc.cassandraTable(test, table1)
val rightrdd = sc.cassandraTable(test, table2)
leftrdd.join(rightRdd)