Spark Task不可序列化/未为第三方Jar定义类

时间:2018-11-25 15:16:16

标签: scala apache-spark mapreduce

我已经在Google或Stackoverflow上搜索了一个星期,但仍然找不到很好的答案。

我有一个化学化合物数据集,我需要使用第三方Jar来读取SDF(类似于JSON的数据格式)中的这些化合物。然后,我必须计算不同化合物之间的相似度。读取和计算需要非常复杂的化学细节,因此我无法自己重现该功能。也就是说,我必须使用此第三方Jar在Spark上的映射函数内运行该函数。 Jar文件称为JCompoundMapper。它使用DFS算法迭代读取原子键,这似乎非常复杂。无论如何,该线程与读取化学化合物无关。但是关于如何在Spark上映射第3方jar。当我尝试执行此操作时,遇到了无法序列化的问题:

import de.zbit.jcmapper.distance.DistanceTanimoto
import de.zbit.jcmapper.distance.IDistanceMeasure
import de.zbit.jcmapper.fingerprinters.EncodingFingerprint
import de.zbit.jcmapper.fingerprinters.features.FeatureMap
import de.zbit.jcmapper.fingerprinters.features.IFeature
import de.zbit.jcmapper.fingerprinters.topological.Encoding2DAllShortestPath
import de.zbit.jcmapper.fingerprinters.topological.Encoding2DCATS
import de.zbit.jcmapper.fingerprinters.topological.Encoding2DECFP
import de.zbit.jcmapper.io.reader.RandomAccessMDLReader
import de.zbit.jcmapper.io.writer.ExporterFullFingerprintCSV
import de.zbit.jcmapper.io.writer.ExporterFullFingerprintTABUnfolded
import de.zbit.jcmapper.io.writer.ExporterLinear
import de.zbit.jcmapper.io.writer.ExporterSDFProperty
import java.io.FileWriter
import java.util.List
import java.io.File

val similarity: IDistanceMeasure = new DistanceTanimoto()
val fingerprinter: Encoding2DAllShortestPath = new Encoding2DAllShortestPath()
val rawFeatures2: List[IFeature] = fingerprinter.getFingerprint(reader.getMol(0))
val rawFeatures: List[IFeature] = fingerprinter.getFingerprint(reader.getMol(1))
def getSimilarity( id1:Int, id2:Int ) : Double = {
    val featureMaps: List[FeatureMap] = new ArrayList[FeatureMap]()
    featureMaps.add(new FeatureMap(rawFeatures))
    featureMaps.add(new FeatureMap(rawFeatures2))
    val temp: Double = similarity.getSimilarity(featureMaps.get(0), featureMaps.get(1))
    return temp


val func = combinations.map(x => {
    getSimilarity(0, 1)
    }).take(5)

Name: org.apache.spark.SparkException
Message: Task not serializable
StackTrace:   at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:345)
  at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:371)
  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.map(RDD.scala:370)
  ... 48 elided
Caused by: java.io.NotSerializableException: de.zbit.jcmapper.io.reader.RandomAccessMDLReader`    

我阅读了其他线程,并了解我必须将变量和函数放在对象中以使其可序列化。但是,如果执行此操作,则会遇到空指针异常错误:

object Holder{
val reader:RandomAccessMDLReader = new RandomAccessMDLReader(new File("datasets/internal.sdf"))
val similarity: IDistanceMeasure = new DistanceTanimoto()
val fingerprinter: Encoding2DAllShortestPath = new Encoding2DAllShortestPath()
val rawFeatures2: List[IFeature] = fingerprinter.getFingerprint(reader.getMol(0))
val rawFeatures: List[IFeature] = fingerprinter.getFingerprint(reader.getMol(1))
def getSimilarity( id1:Int, id2:Int ) : Double = {
    val featureMaps: List[FeatureMap] = new ArrayList[FeatureMap]()
    featureMaps.add(new FeatureMap(rawFeatures))
    featureMaps.add(new FeatureMap(rawFeatures2))
    val temp: Double = similarity.getSimilarity(featureMaps.get(0), featureMaps.get(1))
    return temp
}


val func = combinations.map(x => {
Holder.getSimilarity(0, 1)
}).take(5)


Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-10-245-2-223.ec2.internal, executor 1): java.lang.NullPointerException
    at de.zbit.jcmapper.io.reader.RandomAccessMDLReader.setRanges(Unknown Source)
    at de.zbit.jcmapper.io.reader.RandomAccessMDLReader.<init>(Unknown Source)
    at $line49.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Holder$.<init>(<console>:78)
    at $line49.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.Holder$lzycompute(<console>:77)
    at $line49.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.Holder(<console>:77)
    at $line57.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:85)
    at $line57.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:84)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at scala.collection.AbstractIterator.to(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)`

对于阅读部分,我可以使用巨大的linkedHashMap并将所有化合物存储在那里。但是,我必须使用getSimilarity()函数来使用第三方jar来计算相似度。因此,即使我仅使用getSimilarity()函数,如果将其放在对象中,我也会遇到空指针异常。如果我将其放在对象之外,则将出现任务无法序列化的问题。因此,我有几个问题我找不到很好的答案:

  1. Spark是否支持将第三方Jar映射到每个执行者?在读取器文件中说,Spark是否将读取器类分发到每个执行器中,并分别读取文件或读取文件的整体,然后在每个执行器上将文件分发成较小的部分?
  2. 为什么显示空指针异常问题?看来该对象确实解决了序列化问题,但没有解决空指针异常。
  3. 我是一名新数据工程师,但还不是Spark专家。但是,当我们需要将第三方jar映射到spark并以分布式方式运行该功能时,我愿意学习最佳实践。

非常感谢您的所有回答!非常感谢您的帮助!

最好, 迈克尔

1 个答案:

答案 0 :(得分:2)

我认为问题出在这一行:

val reader:RandomAccessMDLReader = new RandomAccessMDLReader(new File("datasets/internal.sdf"))

通过将此代码放在object中,您的Spark作业运行所在的每个JVM都必须对其进行初始化。因此,实际上,这段代码试图从本地文件系统中读取文件datasets/internal.sdf,无论该文件位于您的Spark集群中。该文件随处可见吗?

如果您还没有准备好将文件放置在任何地方,则可以尝试将其放置在类路径中,并将其作为资源读取。