我正在尝试使用大象鸟查询一些示例protobuf数据。我正在使用AddressBook示例,我将一些假的AddressBooks序列化为文件,并将它们放在/ user / foo / data / elephant-bird / addressbooks /下的hdfs中。查询没有返回结果
我像这样设置表和查询:
add jar /home/foo/downloads/elephant-bird/hadoop-compat/target/elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar; add jar /home/foo/downloads/elephant-bird/core/target/elephant-bird-core-4.6-SNAPSHOT.jar; add jar /home/foo/downloads/elephant-bird/hive/target/elephant-bird-hive-4.6-SNAPSHOT.jar; create external table addresses row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer" with serdeproperties ( "serialization.class"="com.twitter.data.proto.tutorial.AddressBookProtos$AddressBook") STORED AS -- elephant-bird provides an input format for use with hive INPUTFORMAT "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat" -- placeholder as we will not be writing to this table OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" LOCATION '/user/foo/data/elephant-bird/addressbooks/'; describe formatted addresses; OK # col_name data_type comment person array{ struct{ name:string, id:int, email:string, phone:array {struct {number:string, type:string}}}} from deserializer byteData binary from deserializer # Detailed Table Information Database: default Owner: foo CreateTime: Tue Oct 28 13:49:53 PDT 2014 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://foo:8020/user/foo/data/elephant-bird/addressbooks Table Type: EXTERNAL_TABLE Table Parameters: EXTERNAL TRUE transient_lastDdlTime 1414529393 # Storage Information SerDe Library: com.twitter.elephantbird.hive.serde.ProtobufDeserializer InputFormat: com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.class com.twitter.data.proto.tutorial.AddressBookProtos$AddressBook serialization.format 1 Time taken: 0.421 seconds, Fetched: 29 row(s)
当我尝试选择数据时,它不返回任何结果(似乎没有读取行):
select count(*) from addresses; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapred.reduce.tasks= Starting Job = job_1413311929339_0061, Tracking URL = http://foo:8088/proxy/application_1413311929339_0061/ Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1413311929339_0061 Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 1 2014-10-28 13:50:37,674 Stage-1 map = 0%, reduce = 0% 2014-10-28 13:50:51,055 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 2.14 sec 2014-10-28 13:50:52,152 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 2.14 sec MapReduce Total cumulative CPU time: 2 seconds 140 msec Ended Job = job_1413311929339_0061 MapReduce Jobs Launched: Job 0: Reduce: 1 Cumulative CPU: 2.14 sec HDFS Read: 0 HDFS Write: 2 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 140 msec OK 0 Time taken: 37.519 seconds, Fetched: 1 row(s)
如果我创建一个非外部表,或者我将数据显式导入外部表,我会看到同样的事情。
我的设置的版本信息:
Thrift 0.7 protobuf: libprotoc 2.5.0 hadoop: Hadoop 2.5.0-cdh5.2.0 Subversion http://github.com/cloudera/hadoop -r e1f20a08bde76a33b79df026d00a0c91b2298387 Compiled by jenkins on 2014-10-11T21:00Z Compiled with protoc 2.5.0 From source with checksum 309bccd135b199bdfdd6df5f3f4153d
更新:
我在日志中看到此错误。我在HDFS中的数据只是原始protobuf(无压缩)。我想知道这是否是问题,如果我能阅读原始二进制protobuf。
Error: java.io.IOException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:346) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.(HadoopShimsSecure.java:293) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:407) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:560) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:168) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:409) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:332) ... 11 more Caused by: java.io.IOException: No codec for file hdfs://foo:8020/user/foo/data/elephantbird/addressbooks/1000AddressBooks-1684394246.bin found at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.determineFileFormat(MultiInputFormat.java:176) at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.createRecordReader(MultiInputFormat.java:88) at com.twitter.elephantbird.mapreduce.input.RawMultiInputFormat.createRecordReader(RawMultiInputFormat.java:36) at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.(DeprecatedInputFormatWrapper.java:256) at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper.getRecordReader(DeprecatedInputFormatWrapper.java:121) at com.twitter.elephantbird.mapred.input.DeprecatedFileInputFormatWrapper.getRecordReader(DeprecatedFileInputFormatWrapper.java:55) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:65) ... 16 more
答案 0 :(得分:0)
你解决了这个问题吗?
我和你描述的那样有同样的问题。
是的你是对的,我发现无法直接读取原始二进制protobuf。
这是我问过的问题。 Use elephant-bird with hive to read protobuf data
希望有所帮助
祝你好运