我无法使用带有Hive的elephant-bird 4.14反序列化重复字符串的protobuf数据。这似乎是因为重复的字符串功能仅适用于Protobuf 2.6而不适用于Protobuf 2.5。在AWS EMR集群中运行我的配置单元查询时,它使用与AWS Hive捆绑在一起的Protobuf 2.5。即使明确添加Protobuf 2.6 jar后,我也无法摆脱这个错误。我想知道如何让hive使用我明确添加的Protobuf 2.6 jar。
以下是使用的配置单元查询:
add jar s3://gam.test/hive-jars/protobuf-java-2.6.1.jar;
add jar s3://gam.test/hive-jars/GAMDataModel-1.0.jar;
add jar s3://gam.test/hive-jars/GAMCoreModel-1.0.jar;
add jar s3://gam.test/hive-jars/GAMAccessLayer-1.1.jar;
add jar s3://gam.test/hive-jars/RodbHiveStorageHandler-0.12.0-jarjar-final.jar;
add jar s3://gam.test/hive-jars/elephant-bird-core-4.14.jar;
add jar s3://gam.test/hive-jars/elephant-bird-hive-4.14.jar;
add jar s3://gam.test/hive-jars/elephant-bird-hadoop-compat-4.14.jar;
add jar s3://gam.test/hive-jars/protobuf-java-2.6.1.jar;
add jar s3://gam.test/hive-jars/GamProtoBufHiveDeserializer-1.0-jarjar.jar;
drop table GamRelationRodb;
CREATE EXTERNAL TABLE GamRelationRodb
row format serde "com.amazon.hive.serde.GamProtobufDeserializer"
with serdeproperties("serialization.class"=
"com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper")
STORED BY 'com.amazon.rodb.hadoop.hive.RodbHiveStorageHandler' TBLPROPERTIES
("file.name" = 'GAM_Relationship',"file.path" ='s3://pathtofile/');
select * from GamRelationRodb limit 10;
以下是Protobuf文件的格式:
message RepeatedRelationshipWrapper {
repeated relationship.Relationship relationships = 1;
}
message Relationship {
required RelationshipType type = 1;
repeated string ids = 2;
}
enum RelationshipType {
UKNOWN_RELATIONSHIP_TYPE = 0;
PARENT = 1;
CHILD = 2;
}
以下是运行查询时抛出的运行时异常:
Exception in thread "main" java.lang.NoSuchMethodError: com.google.protobuf.LazyStringList.getUnmodifiableView()Lcom/google/protobuf/LazyStringList;
at com.amazon.gam.model.RelationshipProto$Relationship.<init>(RelationshipProto.java:215)
at com.amazon.gam.model.RelationshipProto$Relationship.<init>(RelationshipProto.java:137)
at com.amazon.gam.model.RelationshipProto$Relationship$1.parsePartialFrom(RelationshipProto.java:239)
at com.amazon.gam.model.RelationshipProto$Relationship$1.parsePartialFrom(RelationshipProto.java:234)
at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper.<init>(RepeatedRelationshipWrapperProto.java:126)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper.<init>(RepeatedRelationshipWrapperProto.java:72)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$1.parsePartialFrom(RepeatedRelationshipWrapperProto.java:162)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$1.parsePartialFrom(RepeatedRelationshipWrapperProto.java:157)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$Builder.mergeFrom(RepeatedRelationshipWrapperProto.java:495)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$Builder.mergeFrom(RepeatedRelationshipWrapperProto.java:355)
at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:337)
at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:267)
at com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:170)
at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:882)
at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:267)
at com.twitter.elephantbird.mapreduce.io.ProtobufConverter.fromBytes(ProtobufConverter.java:66)
at com.twitter.elephantbird.hive.serde.ProtobufDeserializer.deserialize(ProtobufDeserializer.java:59)
at com.amazon.hive.serde.GamProtobufDeserializer.deserialize(GamProtobufDeserializer.java:63)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:502)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2098)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
答案 0 :(得分:0)
Protobuf是脆弱的库。它可能是版本2.x之间的线格式兼容,但是protoc生成的类只会链接到与protoc编译器版本完全相同的protobuf JAR。
从根本上说,这意味着你不能更新protobuf,除非在所有依赖项中编排这个。 Great Protobuf upgrade in 2013当Hadoop,Hbase,Hive&amp; c升级时,之后:每个人都冻结在v 2.5,可能是在Hadoop 2.x代码行的整个生命周期中,除非它全部被遮挡或Java 9隐藏了这个问题。
我们更害怕protobuf更新而不是升级到Guava和Jackson,因为后者只打破了每个库,而不是电线格式。
关于2.x升级的主题观看HADOOP-13363,关于在hadoop trunk中移动到protobuf 3的问题HDFS-11010。这很麻烦,因为它确实改变了电线格式,protobuf-json编组断裂等等。
最好只是得出结论,“发现缺乏protobuf代码的二进制兼容性”,并坚持使用protobuf 2.5。遗憾。
您可以使用要使用的整个库堆栈,使用更新的protoc编译器重建它们,匹配protobuf.jor,以及您需要应用的任何其他补丁。我只会向大胆推荐 - 但我对结果感到好奇。如果您尝试这样做,请告诉我们它是如何实现的
进一步阅读fear of dependencies