如何让Spark使用Kryo序列化对象?

时间:2015-02-17 03:28:30

标签: serialization apache-spark kryo

我想将一个对象从驱动程序节点传递到RDD所在的其他节点,以便RDD的每个分区都可以访问该对象,如下面的代码段所示。

object HelloSpark {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf()
                .setAppName("Testing HelloSpark")
                .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                .set("spark.kryo.registrator", "xt.HelloKryoRegistrator")

        val sc = new SparkContext(conf)
        val rdd = sc.parallelize(1 to 20, 4)
        val bytes = new ImmutableBytesWritable(Bytes.toBytes("This is a test"))

        rdd.map(x => x.toString + "-" + Bytes.toString(bytes.get) + " !")
            .collect()
            .foreach(println)

        sc.stop
    }
}

// My registrator
class HelloKryoRegistrator extends KryoRegistrator {
    override def registerClasses(kryo: Kryo) = {
        kryo.register(classOf[ImmutableBytesWritable], new HelloSerializer())
    }
}

//My serializer 
class HelloSerializer extends Serializer[ImmutableBytesWritable] {
    override def write(kryo: Kryo, output: Output, obj: ImmutableBytesWritable): Unit = {
        output.writeInt(obj.getLength)
        output.writeInt(obj.getOffset)
        output.writeBytes(obj.get(), obj.getOffset, obj.getLength)
    }

    override def read(kryo: Kryo, input: Input, t: Class[ImmutableBytesWritable]): ImmutableBytesWritable = {
        val length = input.readInt()
        val offset = input.readInt()
        val bytes  = new Array[Byte](length)
        input.read(bytes, offset, length)

        new ImmutableBytesWritable(bytes)
    }
}

在上面的代码片段中,我尝试在Spark中通过Kryo序列化 ImmutableBytesWritable ,所以我做了以下操作:

  1. 配置传递给spark上下文的 SparkConf 实例,即将“ spark.serializer ”设置为“ org.apache.spark.serializer.KryoSerializer “并设置” spark.kryo.registrator “至” xt.HelloKryoRegistrator “;
  2. 编写一个自定义的Kryo registrator类,在其中注册类 ImmutableBytesWritable ;
  3. ImmutableBytesWritable
  4. 编写序列化程序

    但是,当我在yarn-client模式下提交Spark应用程序时,抛出了以下异常:

      

    线程“main”中的异常org.apache.spark.SparkException:任务不可序列化       在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:166)       在org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.scala:158)       在org.apache.spark.SparkContext.clean(SparkContext.scala:1242)       在org.apache.spark.rdd.RDD.map(RDD.scala:270)       在xt.HelloSpark $ .main(HelloSpark.scala:23)       在xt.HelloSpark.main(HelloSpark.scala)       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)       at java.lang.reflect.Method.invoke(Method.java:606)       在org.apache.spark.deploy.SparkSubmit $ .launch(SparkSubmit.scala:325)       在org.apache.spark.deploy.SparkSubmit $ .main(SparkSubmit.scala:75)       在org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)   引起:java.io.NotSerializableException:org.apache.hadoop.hbase.io.ImmutableBytesWritable       at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)       at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)       at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)       at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)       at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)       at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)       在org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)       在org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)       在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:164)       ......还有12个

    Kryo似乎无法序列化 ImmutableBytesWritable 。那么让Spark使用Kryo序列化对象的正确方法是什么? Kryo可以序列化任何类型吗?

1 个答案:

答案 0 :(得分:1)

这种情况正在发生,因为您在关闭时使用了ImmutableBytesWritable。 Spark不支持使用Kryo进行闭包序列化(仅限RDD中的对象)。你可以借助这个来解决你的问题:

Spark - Task not serializable: How to work with complex map closures that call outside classes/objects?

您只需要在通过闭包之前序列化对象,然后再反序列化。即使您的类不是Serializable,这种方法也可以正常工作,因为它在幕后使用Kryo。你需要的只是一些咖喱。 ;)

这是一个示例草图:

def genMapper(kryoWrapper: KryoSerializationWrapper[(Foo => Bar)])
               (foo: Foo) : Bar = {
    kryoWrapper.value.apply(foo)
}
val mapper = genMapper(KryoSerializationWrapper(new ImmutableBytesWritable(Bytes.toBytes("This is a test")))) _
rdd.flatMap(mapper).collectAsMap()

object ImmutableBytesWritable(bytes: Bytes) extends (Foo => Bar) {
    def apply(foo: Foo) : Bar = { //This is the real function }
}