我想得到; employeeId;来自; employee_table;并使用此id查询employee_address表以获取地址。
表没有问题。但是我运行以下代码,我得到org.apache.spark.SparkException: Task not serializable
我想我知道这个问题。问题是sparkContext是master而不是worker。但我不知道如何解决这个问题。
val employeeRDDRdd = sc.cassandraTable("local_keyspace", "employee_table")
try {
val data = employeeRDDRdd
.map(row => {
row.getStringOption("employeeID") match {
case Some(s) if (s != null) && s.nonEmpty => s
case None => ""
}
})
//create tuple of employee id and address. Filtering out cases when for an employee address is empty.
val id = data
.map(s => (s,getID(s)))
filter(tups => tups._2.nonEmpty)
//printing out total size of rdd.
println(id.count())
} catch {
case e: Exception => e.printStackTrace()
}
def getID(employeeID: String): String = {
val addressRDD = sc.cassandraTable("local_keyspace", "employee_address")
val data = addressRDD.map(row => row.getStringOption("address") match {
case Some(s) if (s != null) && s.nonEmpty => s
case None => ""
})
data.collect()(0)
}
例外 ==>
rg.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2039)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.map(RDD.scala:365)
答案 0 :(得分:3)
序列化问题由
引起val addressRDD = sc.cassandraTable("local_keyspace", "employee_address")
此部分在序列化lambda中使用:
val id = data
.map(s => (s,getID(s)))
所有RDD
转换代表远程执行的代码,这意味着它们的整个内容必须是可序列化的。
Spark Context不是可序列化的,但是对于" getIDs"工作,所以有一个例外。基本规则是,您无法触及任何SparkContext
转化中的RDD
。
如果您实际上尝试加入cassandra中的数据,那么您有几个选择。
单独加载两个RDD并执行Spark Join
val leftrdd = sc.cassandraTable(test, table1)
val rightrdd = sc.cassandraTable(test, table2)
leftrdd.join(rightRdd)