这是我的代码:
val bg = imageBundleRDD.first() //bg:[Text, BundleWritable]
val res= imageBundleRDD.map(data => {
val desBundle = colorToGray(bg._2) //lineA:NotSerializableException: org.apache.hadoop.io.Text
//val desBundle = colorToGray(data._2) //lineB:everything is ok
(data._1, desBundle)
})
println(res.count)
lineB运行良好,但lineA显示: org.apache.spark.SparkException:作业已中止:任务不可序列化:java.io.NotSerializableException:org.apache.hadoop.io.Text
我尝试使用Kryo来解决我的问题,但似乎没有任何改变:
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator
class MyRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[Text])
kryo.register(classOf[BundleWritable])
}
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
感谢!!!
答案 0 :(得分:1)
当我的Java代码正在读取包含文本键的序列文件时,我遇到了类似的问题。 我发现这篇文章很有帮助:
就我而言,我使用map:
将Text转换为StringJavaPairRDD<String, VideoRecording> mapped = videos.map(new PairFunction<Tuple2<Text,VideoRecording>,String,VideoRecording>() {
@Override
public Tuple2<String, VideoRecording> call(
Tuple2<Text, VideoRecording> kv) throws Exception {
// Necessary to copy value as Hadoop chooses to reuse objects
VideoRecording vr = new VideoRecording(kv._2);
return new Tuple2(kv._1.toString(), vr);
}
});
请注意JavaSparkContext中的sequenceFile方法的API中的这个注释:
注意:由于Hadoop的RecordReader类为每条记录重复使用相同的Writable对象,因此直接缓存返回的RDD将创建对同一对象的许多引用。如果您计划直接缓存Hadoop可写对象,则应首先使用map函数复制它们。
答案 1 :(得分:0)
您的代码出现序列化问题的原因是您的Kryo设置在关闭时不太正确:
变化:
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
为:
val sparkConf = new SparkConf()
// ... set master, appname, etc, then:
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(sparkConf)
答案 2 :(得分:0)
在 Apache Spark 处理序列文件时,我们必须遵循以下技术:
-- Use Java equivalent Data Types in place of Hadoop data types. -- Spark Automatically converts the Writables into Java equivalent Types. Ex:- We have a sequence file "xyz", here key type is say Text and value is LongWritable. When we use this file to create an RDD, we need use their java equivalent data types i.e., String and Long respectively. val mydata = = sc.sequenceFile[String, Long]("path/to/xyz") mydata.collect