环境:火花1.60。我用scala。 我可以通过sbt编译程序,但是当我提交程序时,它遇到了错误。 我的完整错误如下:
238 17/01/21 18:32:24 INFO net.NetworkTopology: Adding a new node: /YH11070029/10.39.0.213:50010
17/01/21 18:32:24 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.39.0.44:41961 with 2.7 GB RAM, BlockManagerId(349, 10.39.0.44, 41961)
17/01/21 18:32:24 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.39.2.178:48591 with 2.7 GB RAM, BlockManagerId(518, 10.39.2.178, 48591)
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:93)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:82)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:82)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$aggregateByKey$1.apply(PairRDDFunctions.scala:177)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$aggregateByKey$1.apply(PairRDDFunctions.scala:166)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.PairRDDFunctions.aggregateByKey(PairRDDFunctions.scala:166)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$aggregateByKey$3.apply(PairRDDFunctions.scala:206)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$aggregateByKey$3.apply(PairRDDFunctions.scala:206)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.PairRDDFunctions.aggregateByKey(PairRDDFunctions.scala:205)
at com.sina.adalgo.feature.ETL$$anonfun$13.apply(ETL.scala:190)
at com.sina.adalgo.feature.ETL$$anonfun$13.apply(ETL.scala:102)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
代码的目的是统计分类特征的频率。主要代码如下:
object ETL extends Serializable {
... ...
val cateList = featureData.map{v =>
case (psid: String, label: String, cate_features: ParArray[String], media_features: String) =>
val pair_feature = cate_features.zipWithIndex.map(x => (x._2, x._1))
pair_feature
}.flatMap(_.toList)
def seqop(m: HashMap[String, Int] , s: String) : HashMap[String, Int]={
var x = m.getOrElse(s, 0)
x += 1
m += s -> x
m
}
def combop(m: HashMap[String, Int], n: HashMap[String, Int]) : HashMap[String, Int]={
for (k <- n) {
var x = m.getOrElse(k._1, 0)
x += k._2
m += k._1 -> x
}
m
}
val hash = HashMap[String, Int]()
val feaFreq = cateList.aggregateByKey(hash)(seqop, combop)// (i, HashMap[String, Int]) i corresponded with categorical feature
该对象已继承Serializable。 为什么?你可以帮帮我吗?
答案 0 :(得分:0)
对我来说,当我们使用一个闭包作为聚合函数来解决一些不需要的对象和/或有时只是一个在我们的spark驱动程序代码的主类中的函数时,这个问题通常发生在Spark中。
我怀疑这个可能就是这种情况,因为你的stacktrace涉及org.apache.spark.util.ClosureCleaner
作为顶级罪魁祸首。
这是有问题的,因为在这种情况下,当Spark尝试将该函数转发给工作人员以便他们可以进行实际聚合时,它最终会比您实际预期的序列化更多:函数itelf及其周围的类。
另请参阅this post by Erik Erlandson,其中详细解释了闭包序列化的一些边界情况以及Spark 1.6 notes on closures。
快速修复可能是将您在aggregateByKey
中使用的函数的定义移动到一个单独的对象,与其余代码完全无关。