HZ on LZO Compressed and Indexed protobuf文件

时间:2016-12-12 19:24:43

标签: hive lzo elephantbird hadoop-lzo

我有一些原始的Protobuf文件,我用LZO压缩,后来在它们上创建了索引。用于生成LZO文件的Mapper如下:

public static class ProtoMap extends Mapper<NullWritable, BytesWritable, Text, ProtobufWritable>{
    ProtobufWritable<MyProtoClass> protoWritable = ProtobufWritable.newInstance(MyProtoClass.class);

    public void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException{
        byte[] contentBytes = value.copyBytes();
        CodedInputStream cis = CodedInputStream.newInstance(contentBytes);
        while(cis.getTotalBytesRead() < contentBytes.length){

            //Read a a varint length and subsequently read the message corresponding to the length
            int len = (int) CodedInputStream.decodeZigZag64(cis.readInt64());

            IntWritable intWrLength = new IntWritable();
            intWrLength.set(len);
            byte[] data = cis.readRawBytes(len);

            log.debug("Total message length: " + contentBytes.length);
            log.debug("varint value: " + len);

            MyProto.MyProtoClass msg = MyProtoClass.parseFrom(data);
            protoWritable.set(msg);     

            log.debug(msg.getUUID());
            context.write(new Text(msg.getUUID()), protoWritable);
        }
    }
}

作业run配置代码段如下:

LzoProtobufB64LineOutputFormat.setClassConf(MyProtoClass.class,HadoopCompat.getConfiguration(job));
job.setOutputFormatClass(LzoProtobufB64LineOutputFormat.class);
LzoProtobufB64LineOutputFormat.setOutputPath(job, new Path(args[2]));
LzoProtobufB64LineOutputFormat.setCompressOutput(job, true);

这会产生LZO文件,我想用Hive查询。我使用以下内容创建表:

create external table MyTable
row format serde
"com.twitter.elephantbird.hive.serde.LzoProtobufHiveSerde" 
with serdeproperties
("serialization.class"="com.mycompany.myclass$MyProtoClass")
STORED AS
-- elephant-bird provides an input format for use with hive
INPUTFORMAT
"com.twitter.elephantbird.mapred.input.    DeprecatedLzoTextInputFormat.classr"
-- placeholder as we will not be writing to this table
OUTPUTFORMAT   "org.apache.hadoop.hive.ql.io.
HiveIgnoreKeyTextOutputFormat"
LOCATION '/path/to/my/lzo/indexed/file';

表创建是成功的,当我这样做时它也可以用。

hive> describe table formatted;

但是,当我这样做时

hive> select count(*) from MyTable;

我收到以下错误

Failed with exception java.io.IOException:java.lang.ClassCastException:
org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.io.BytesWritable

有人可以帮我弄清楚我做错了吗?

0 个答案:

没有答案