Question

在使用Java协议缓冲区类作为Spark作业中RDD的对象模型时，我遇到了一个错误，

对于我的应用程序，我的proto文件具有重复字符串的属性。例如

SugarContext.init(context);

由此，2.5.0 protoc编译器生成类似

的Java代码

message OntologyHumanName 
{ 
repeated string family = 1;
}

如果我运行使用Kryo序列化程序的Scala Spark作业，则会出现以下错误

private com.google.protobuf.LazyStringList family_ = com.google.protobuf.LazyStringArrayList.EMPTY;

相同的代码适用于spark.serializer = org.apache.spark.serializer.JavaSerializer。

我的环境是使用JDK 1.8.0_60的CDH QuickStart 5.5

Answer 1

尝试向Lazy类注册：

Kryo kryo = new Kryo()

kryo.register(com.google.protobuf.LazyStringArrayList.class)

对于自定义Protobuf消息，也请查看此answer中的解决方案，以注册由protoc生成的自定义/嵌套类。

Answer 2

我认为您的RDD类型包含类OntologyHumanName。喜欢：RDD [（String，OntologyHumanName）]，这种类型的RDD在shuffle阶段巧合。查看：https://github.com/EsotericSoftware/kryo#kryoserializable kryo无法对抽象类进行序列化。

阅读spark doc：http://spark.apache.org/docs/latest/tuning.html#data-serialization

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)

关于kryo doc：

public class SomeClass implements KryoSerializable {
   // ...

   public void write (Kryo kryo, Output output) {
      // ...
   }

   public void read (Kryo kryo, Input input) {
      // ...
   }
}

但是类：OntologyHumanName是由protobuf自动生成的。所以我不认为这是一个很好的方法。

尝试使用case类替换OntologyHumanName以避免直接在类OntologyHumanName上进行序列化。这种方式我没有尝试过，它可能无法正常工作。
```
case class OntologyHumanNameScalaCaseClass(val humanNames: OntologyHumanName)
```

一种丑陋的方式。我刚刚将protobuf类转换为scala的东西。这种方式可能会失败。像：

import scala.collection.JavaConverters._

val humanNameObj: OntologyHumanName = ...
val families: List[String] = humamNameObj.getFamilyList.asScala  //use this to replace the humanNameObj.

希望解决上面的问题。

使用Spark的Kryo序列化程序与具有字符串数组的java协议缓冲区时出错

2 个答案: