Spark广播/序列化错误

时间:2014-06-26 19:20:34

标签: scala apache-spark

我为Mahout作业的Spark版本创建了一个名为"项目相似度"有几个测试,都可以在本地[4] Spark独立工作。代码甚至可以读取和写入群集HDFS。但是切换到集群Spark有一个似乎与广播和/或序列化相关的问题。

代码使用HashBiMap,这是一个Guava Java的东西。这些为每个Mahout drm(分布式矩阵)创建了两个,用于双向行和列ID查找。它们被创建一次,然后广播以便随处访问。

当我在集群Spark上运行时,我收到以下错误。有一次我们使用HashMaps,他们似乎在集群上工作。所以我怀疑HashBiMap导致了这个问题。我也怀疑它可能与广播中的序列化有关。这是一段代码和错误。

 // create BiMaps for bi-directional lookup of ID by either Mahout ID or external ID
 // broadcast them for access in distributed processes, so they are not recalculated in every task.
 // rowIDDictionary is a HashBiMap[String, Int]
 val rowIDDictionary = asOrderedDictionary(rowIDs) // this creates the HashBiMap in a non-dsitributed manner
 val rowIDDictionary_bcast = mc.broadcast(rowIDDictionary)

 val columnIDDictionary = asOrderedDictionary(columnIDs)) // this creates the HashBiMap in a non-dsitributed manner
 val columnIDDictionary_bcast = mc.broadcast(columnIDDictionary)

 val indexedInteractions =
   interactions.map { case (rowID, columnID) =>   //<<<<<<<<<<< this is the stage being submitted before the error
     val rowIndex = rowIDDictionary_bcast.value.get(rowID).get
     val columnIndex = columnIDDictionary_bcast.value.get(columnID).get

     rowIndex -> columnIndex
   }

在访问_bcast vals时,执行interaction.map时似乎发生了错误。知道从哪里开始寻找这个吗?

14/06/26 11:23:36 INFO scheduler.DAGScheduler: Submitting Stage 9 (MappedRDD[17] at map at TextDelimitedReaderWriter.scala:83), which has no missing parents
14/06/26 11:23:36 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 9 (MappedRDD[17] at map at TextDelimitedReaderWriter.scala:83)
14/06/26 11:23:36 INFO scheduler.TaskSchedulerImpl: Adding task set 9.0 with 2 tasks
14/06/26 11:23:36 INFO scheduler.TaskSetManager: Starting task 9.0:0 as TID 16 on executor 0: occam4 (PROCESS_LOCAL)
14/06/26 11:23:36 INFO scheduler.TaskSetManager: Serialized task 9.0:0 as 2418 bytes in 0 ms
14/06/26 11:23:36 INFO scheduler.TaskSetManager: Starting task 9.0:1 as TID 17 on executor 0: occam4 (PROCESS_LOCAL)
14/06/26 11:23:36 INFO scheduler.TaskSetManager: Serialized task 9.0:1 as 2440 bytes in 0 ms
14/06/26 11:23:36 WARN scheduler.TaskSetManager: Lost TID 16 (task 9.0:0)
14/06/26 11:23:36 WARN scheduler.TaskSetManager: Loss was due to java.lang.NullPointerException
java.lang.NullPointerException
    at com.google.common.collect.HashBiMap.seekByKey(HashBiMap.java:180)
    at com.google.common.collect.HashBiMap.put(HashBiMap.java:230)
    at com.google.common.collect.HashBiMap.put(HashBiMap.java:218)
    at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
    at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
    at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:102)
    at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)
    at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1871)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1969)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1969)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
    at org.apache.spark.scheduler.ShuffleMapTask$.deserializeInfo(ShuffleMapTask.scala:69)
    at org.apache.spark.scheduler.ShuffleMapTask.readExternal(ShuffleMapTask.scala:138)
    at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1814)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1773)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at         org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:662)

1 个答案:

答案 0 :(得分:2)

看起来你正在使用kryo序列化,你是否也在本地测试中使用它? 如果HashBiMap的kryo序列化不成功,您可能希望显式注册类kryo以使用Java serilization。