反序列化(加载到hive表中)时的ClassCastException以avro架构支持的Parquet格式

时间:2016-03-02 12:20:02

标签: hive avro parquet

我正在尝试使用Avro Schema(Avro Backed)和&amp ;;将csv数据序列化为Parquet格式。再次将其读入蜂巢表。

使用以下示例代码段(序列化单个记录的示例代码)成功进行序列化:

import java.io.File;
import java.io.IOException;
import java.math.BigDecimal;
import java.math.BigInteger;
import java.nio.ByteBuffer;

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericData.Record;
import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroSchemaConverter;
import org.apache.parquet.avro.AvroWriteSupport;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;
import org.apache.parquet.schema.MesspidType;

public class AvroParquetConverter {

    public static void main(String[] args) throws IOException {
        Schema avroSchema = new Schema.Parser().parse(new File("schema.avsc"));
        GenericRecord myrecord = new GenericData.Record(avroSchema);
        String outputFilename = "/home/jai/sample1000-snappy.parquet";
        Path outputPath = new Path(outputFilename);
        MesspidType parquetSchema = new AvroSchemaConverter()
                .convert(avroSchema);
        AvroWriteSupport writeSupport = new AvroWriteSupport(parquetSchema,
                avroSchema);
        CompressionCodecName compressionCodecSnappy = CompressionCodecName.SNAPPY;
        int blockSize = 256 * 1024 * 1024;
        int ppidSize = 64 * 1024;

        ParquetWriter parquetWriterSnappy = new ParquetWriter(outputPath,
                writeSupport, compressionCodecSnappy, blockSize, ppidSize);
        BigDecimal bd = new BigDecimal(20);
        GenericRecord myrecordTemp = new GenericData.Record(avroSchema);
        myrecord.put("name", "Abhijeet1");
        myrecord.put("pid", 20);
        myrecord.put("favorite_number", 22);
        String bd1 = "13.5";
        BigDecimal bdecimal = new BigDecimal(bd1);
        bdecimal.setScale(15, 6);
        BigInteger bi = bdecimal.unscaledValue();
        byte[] barray = bi.toByteArray();
        ByteBuffer byteBuffer = ByteBuffer.allocate(barray.length);
        byteBuffer.put(barray);
        byteBuffer.rewind();
        myrecord.put("price", byteBuffer);
        parquetWriterSnappy.write(myrecord);
        parquetWriterSnappy.close();
    }
}

尝试使用以下语句完成十进制到bytebuffer转换:

ByteBuffer.wrap(bdecimal.unscaledValue().toByteArray());

以下是avro架构文件

{
    "namespace": "avropoc",
    "type": "record",
    "name": "User",
    "fields": [
             {"name": "name", "type": "string", "default" : "null"},
             {"name": "favorite_number",  "type": "int", "default": 0 },
             {"name": "pid",  "type":"int", "default" : 0 },
             {"name": "price", "type": {"type" : "bytes","logicalType":"decimal","precision":15,"scale":6}, "default" : 0 }
     ]
}

还尝试了对架构的修改:

{"name": "price", "type": "bytes","logicalType":"decimal","precision":15,"scale":6, "default" : 0 }

我正在创建Hive表,如下所示:

create external table avroparquet1
( name string, favorite_number int,
pid int, price DECIMAL(15,6))
STORED AS PARQUET;

但是当我运行十进制字段价格的查询时,我收到以下错误消息:

  

异常失败   产生java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:   java.lang.ClassCastException:org.apache.hadoop.io.BytesWritable   无法转换为org.apache.hadoop.hive.serde2.io.HiveDecimalWritable

这看起来像镶木地板/ avro / hive相关的问题,它无法反序列化十进制数,如果avro需要写为ByteBuffer。

我在avro 1.8.0上试过这个,镶木地板1.8.1& Hive 1.1.0。

任何帮助都将不胜感激。

1 个答案:

答案 0 :(得分:0)

Hive为DECIMAL(22,7)生成的实际Schema - 使用从that LinkedIn utility派生的一些代码来检查实际的Parquet文件 - 看起来像......

  • Parquet语法:optional fixed_len_byte_array(10) my_dec_22_7;
  • AVRO语法:{ "name":"my_dec_22_7","type":["null",{"type":"fixed", "name":"my_dec_22_7","size":10} ], "default":null }

...其中10似乎是将带有22位数的BigInteger转储到byte[]所需的字节数。例如,查看AvroSerdeUtils源代码及其转储HiveDecimal的方式。

话虽这么说,我真的不知道如何在Parquet文件中读/写DECIMAL。 DOUBLE和BIGINT更容易处理,因为它们受IEEE标准类型(以及AVRO标准类型和Parquet标准类型)的支持。