无法使用AWS中的elephant-bird和Hive反序列化Protobuf(2.6.1)数据

时间:2017-03-31 12:07:33

标签: amazon-web-services hadoop hive protocol-buffers elephantbird

我无法使用带有Hive的elephant-bird 4.14反序列化重复字符串的protobuf数据。这似乎是因为重复的字符串功能仅适用于Protobuf 2.6而不适用于Protobuf 2.5。在AWS EMR集群中运行我的配置单元查询时,它使用与AWS Hive捆绑在一起的Protobuf 2.5。即使明确添加Protobuf 2.6 jar后,我也无法摆脱这个错误。我想知道如何让hive使用我明确添加的Protobuf 2.6 jar。

以下是使用的配置单元查询:

    add jar s3://gam.test/hive-jars/protobuf-java-2.6.1.jar;
    add jar s3://gam.test/hive-jars/GAMDataModel-1.0.jar;
    add jar s3://gam.test/hive-jars/GAMCoreModel-1.0.jar;
    add jar s3://gam.test/hive-jars/GAMAccessLayer-1.1.jar;
    add jar s3://gam.test/hive-jars/RodbHiveStorageHandler-0.12.0-jarjar-final.jar;
    add jar s3://gam.test/hive-jars/elephant-bird-core-4.14.jar;
    add jar s3://gam.test/hive-jars/elephant-bird-hive-4.14.jar;
    add jar s3://gam.test/hive-jars/elephant-bird-hadoop-compat-4.14.jar;
    add jar s3://gam.test/hive-jars/protobuf-java-2.6.1.jar;
    add jar s3://gam.test/hive-jars/GamProtoBufHiveDeserializer-1.0-jarjar.jar;
    drop table GamRelationRodb;

    CREATE EXTERNAL TABLE GamRelationRodb
    row format serde "com.amazon.hive.serde.GamProtobufDeserializer"
    with serdeproperties("serialization.class"= 
 "com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper")
    STORED BY 'com.amazon.rodb.hadoop.hive.RodbHiveStorageHandler' TBLPROPERTIES 
    ("file.name" = 'GAM_Relationship',"file.path" ='s3://pathtofile/');

    select * from GamRelationRodb limit 10;

以下是Protobuf文件的格式:

message RepeatedRelationshipWrapper { 
    repeated relationship.Relationship relationships = 1;
}

message Relationship {
    required RelationshipType type = 1;
    repeated string ids = 2;
}

enum RelationshipType {
    UKNOWN_RELATIONSHIP_TYPE = 0;
    PARENT = 1;
    CHILD = 2;
}

以下是运行查询时抛出的运行时异常:

Exception in thread "main" java.lang.NoSuchMethodError: com.google.protobuf.LazyStringList.getUnmodifiableView()Lcom/google/protobuf/LazyStringList;
    at com.amazon.gam.model.RelationshipProto$Relationship.<init>(RelationshipProto.java:215)
    at com.amazon.gam.model.RelationshipProto$Relationship.<init>(RelationshipProto.java:137)
    at com.amazon.gam.model.RelationshipProto$Relationship$1.parsePartialFrom(RelationshipProto.java:239)
    at com.amazon.gam.model.RelationshipProto$Relationship$1.parsePartialFrom(RelationshipProto.java:234)
    at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper.<init>(RepeatedRelationshipWrapperProto.java:126)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper.<init>(RepeatedRelationshipWrapperProto.java:72)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$1.parsePartialFrom(RepeatedRelationshipWrapperProto.java:162)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$1.parsePartialFrom(RepeatedRelationshipWrapperProto.java:157)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$Builder.mergeFrom(RepeatedRelationshipWrapperProto.java:495)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$Builder.mergeFrom(RepeatedRelationshipWrapperProto.java:355)
    at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:337)
    at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:267)
    at com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:170)
    at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:882)
    at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:267)
    at com.twitter.elephantbird.mapreduce.io.ProtobufConverter.fromBytes(ProtobufConverter.java:66)
    at com.twitter.elephantbird.hive.serde.ProtobufDeserializer.deserialize(ProtobufDeserializer.java:59)
    at com.amazon.hive.serde.GamProtobufDeserializer.deserialize(GamProtobufDeserializer.java:63)
    at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:502)
    at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
    at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)
    at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2098)
    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
    at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

1 个答案:

答案 0 :(得分:0)

Protobuf是脆弱的库。它可能是版本2.x之间的线格式兼容,但是protoc生成的类只会链接到与protoc编译器版本完全相同的protobuf JAR。

从根本上说,这意味着你不能更新protobuf,除非在所有依赖项中编排这个。 Great Protobuf upgrade in 2013当Hadoop,Hbase,Hive&amp; c升级时,之后:每个人都冻结在v 2.5,可能是在Hadoop 2.x代码行的整个生命周期中,除非它全部被遮挡或Java 9隐藏了这个问题。

我们更害怕protobuf更新而不是升级到Guava和Jackson,因为后者只打破了每个库,而不是电线格式

关于2.x升级的主题观看HADOOP-13363,关于在hadoop trunk中移动到protobuf 3的问题HDFS-11010。这很麻烦,因为它确实改变了电线格式,protobuf-json编组断裂等等。

最好只是得出结论,“发现缺乏protobuf代码的二进制兼容性”,并坚持使用protobuf 2.5。遗憾。

您可以使用要使用的整个库堆栈,使用更新的protoc编译器重建它们,匹配protobuf.jor,以及您需要应用的任何其他补丁。我只会向大胆推荐 - 但我对结果感到好奇。如果您尝试这样做,请告诉我们它是如何实现的

进一步阅读fear of dependencies