我有一些原始的Protobuf文件,我用LZO压缩,后来在它们上创建了索引。用于生成LZO文件的Mapper
如下:
public static class ProtoMap extends Mapper<NullWritable, BytesWritable, Text, ProtobufWritable>{
ProtobufWritable<MyProtoClass> protoWritable = ProtobufWritable.newInstance(MyProtoClass.class);
public void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException{
byte[] contentBytes = value.copyBytes();
CodedInputStream cis = CodedInputStream.newInstance(contentBytes);
while(cis.getTotalBytesRead() < contentBytes.length){
//Read a a varint length and subsequently read the message corresponding to the length
int len = (int) CodedInputStream.decodeZigZag64(cis.readInt64());
IntWritable intWrLength = new IntWritable();
intWrLength.set(len);
byte[] data = cis.readRawBytes(len);
log.debug("Total message length: " + contentBytes.length);
log.debug("varint value: " + len);
MyProto.MyProtoClass msg = MyProtoClass.parseFrom(data);
protoWritable.set(msg);
log.debug(msg.getUUID());
context.write(new Text(msg.getUUID()), protoWritable);
}
}
}
作业run
配置代码段如下:
LzoProtobufB64LineOutputFormat.setClassConf(MyProtoClass.class,HadoopCompat.getConfiguration(job));
job.setOutputFormatClass(LzoProtobufB64LineOutputFormat.class);
LzoProtobufB64LineOutputFormat.setOutputPath(job, new Path(args[2]));
LzoProtobufB64LineOutputFormat.setCompressOutput(job, true);
这会产生LZO文件,我想用Hive查询。我使用以下内容创建表:
create external table MyTable
row format serde
"com.twitter.elephantbird.hive.serde.LzoProtobufHiveSerde"
with serdeproperties
("serialization.class"="com.mycompany.myclass$MyProtoClass")
STORED AS
-- elephant-bird provides an input format for use with hive
INPUTFORMAT
"com.twitter.elephantbird.mapred.input. DeprecatedLzoTextInputFormat.classr"
-- placeholder as we will not be writing to this table
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.
HiveIgnoreKeyTextOutputFormat"
LOCATION '/path/to/my/lzo/indexed/file';
表创建是成功的,当我这样做时它也可以用。
hive> describe table formatted;
但是,当我这样做时
hive> select count(*) from MyTable;
我收到以下错误
Failed with exception java.io.IOException:java.lang.ClassCastException:
org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.io.BytesWritable
有人可以帮我弄清楚我做错了吗?