aparch spark,NotSerializableException:org.apache.hadoop.io.Text

时间:2014-01-12 03:59:14

标签: apache-spark notserializableexception

这是我的代码:

  val bg = imageBundleRDD.first()    //bg:[Text, BundleWritable]
  val res= imageBundleRDD.map(data => {
                                val desBundle = colorToGray(bg._2)        //lineA:NotSerializableException: org.apache.hadoop.io.Text
                                //val desBundle = colorToGray(data._2)    //lineB:everything is ok
                                (data._1, desBundle)
                             })
  println(res.count)

lineB运行良好,但lineA显示: org.apache.spark.SparkException:作业已中止:任务不可序列化:java.io.NotSerializableException:org.apache.hadoop.io.Text

我尝试使用Kryo来解决我的问题,但似乎没有任何改变:

import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator

class MyRegistrator extends KryoRegistrator {
    override def registerClasses(kryo: Kryo) {
       kryo.register(classOf[Text])
       kryo.register(classOf[BundleWritable])
  }
}

System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...

感谢!!!

3 个答案:

答案 0 :(得分:1)

当我的Java代码正在读取包含文本键的序列文件时,我遇到了类似的问题。 我发现这篇文章很有帮助:

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-solve-java-io-NotSerializableException-org-apache-hadoop-io-Text-td2650.html

就我而言,我使用map:

将Text转换为String
JavaPairRDD<String, VideoRecording> mapped = videos.map(new PairFunction<Tuple2<Text,VideoRecording>,String,VideoRecording>() {
    @Override
    public Tuple2<String, VideoRecording> call(
            Tuple2<Text, VideoRecording> kv) throws Exception {
        // Necessary to copy value as Hadoop chooses to reuse objects
        VideoRecording vr = new VideoRecording(kv._2);
        return new Tuple2(kv._1.toString(), vr);
    }
});

请注意JavaSparkContext中的sequenceFile方法的API中的这个注释:

注意:由于Hadoop的RecordReader类为每条记录重复使用相同的Writable对象,因此直接缓存返回的RDD将创建对同一对象的许多引用。如果您计划直接缓存Hadoop可写对象,则应首先使用map函数复制它们。

答案 1 :(得分:0)

您的代码出现序列化问题的原因是您的Kryo设置在关闭时不太正确:

变化:

System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...

为:

val sparkConf = new SparkConf()
  // ... set master, appname, etc, then:
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .set("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")

val sc = new SparkContext(sparkConf)

答案 2 :(得分:0)

Apache Spark 处理序列文件时,我们必须遵循以下技术:

 -- Use Java equivalent Data Types in place of Hadoop data types.
 -- Spark Automatically converts the Writables into Java equivalent Types.

Ex:- We have a sequence file "xyz", here key type is say Text and value
is LongWritable. When we use this file to create an RDD, we need use  their 
java  equivalent data types i.e., String and Long respectively.

 val mydata = = sc.sequenceFile[String, Long]("path/to/xyz")
 mydata.collect