我正在尝试使用Avro Schema(Avro Backed)和&amp ;;将csv数据序列化为Parquet格式。再次将其读入蜂巢表。
使用以下示例代码段(序列化单个记录的示例代码)成功进行序列化:
import java.io.File;
import java.io.IOException;
import java.math.BigDecimal;
import java.math.BigInteger;
import java.nio.ByteBuffer;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericData.Record;
import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroSchemaConverter;
import org.apache.parquet.avro.AvroWriteSupport;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;
import org.apache.parquet.schema.MesspidType;
public class AvroParquetConverter {
public static void main(String[] args) throws IOException {
Schema avroSchema = new Schema.Parser().parse(new File("schema.avsc"));
GenericRecord myrecord = new GenericData.Record(avroSchema);
String outputFilename = "/home/jai/sample1000-snappy.parquet";
Path outputPath = new Path(outputFilename);
MesspidType parquetSchema = new AvroSchemaConverter()
.convert(avroSchema);
AvroWriteSupport writeSupport = new AvroWriteSupport(parquetSchema,
avroSchema);
CompressionCodecName compressionCodecSnappy = CompressionCodecName.SNAPPY;
int blockSize = 256 * 1024 * 1024;
int ppidSize = 64 * 1024;
ParquetWriter parquetWriterSnappy = new ParquetWriter(outputPath,
writeSupport, compressionCodecSnappy, blockSize, ppidSize);
BigDecimal bd = new BigDecimal(20);
GenericRecord myrecordTemp = new GenericData.Record(avroSchema);
myrecord.put("name", "Abhijeet1");
myrecord.put("pid", 20);
myrecord.put("favorite_number", 22);
String bd1 = "13.5";
BigDecimal bdecimal = new BigDecimal(bd1);
bdecimal.setScale(15, 6);
BigInteger bi = bdecimal.unscaledValue();
byte[] barray = bi.toByteArray();
ByteBuffer byteBuffer = ByteBuffer.allocate(barray.length);
byteBuffer.put(barray);
byteBuffer.rewind();
myrecord.put("price", byteBuffer);
parquetWriterSnappy.write(myrecord);
parquetWriterSnappy.close();
}
}
尝试使用以下语句完成十进制到bytebuffer转换:
ByteBuffer.wrap(bdecimal.unscaledValue().toByteArray());
以下是avro架构文件
{
"namespace": "avropoc",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string", "default" : "null"},
{"name": "favorite_number", "type": "int", "default": 0 },
{"name": "pid", "type":"int", "default" : 0 },
{"name": "price", "type": {"type" : "bytes","logicalType":"decimal","precision":15,"scale":6}, "default" : 0 }
]
}
还尝试了对架构的修改:
{"name": "price", "type": "bytes","logicalType":"decimal","precision":15,"scale":6, "default" : 0 }
我正在创建Hive表,如下所示:
create external table avroparquet1
( name string, favorite_number int,
pid int, price DECIMAL(15,6))
STORED AS PARQUET;
但是当我运行十进制字段价格的查询时,我收到以下错误消息:
异常失败 产生java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException:org.apache.hadoop.io.BytesWritable 无法转换为org.apache.hadoop.hive.serde2.io.HiveDecimalWritable
这看起来像镶木地板/ avro / hive相关的问题,它无法反序列化十进制数,如果avro需要写为ByteBuffer。
我在avro 1.8.0上试过这个,镶木地板1.8.1& Hive 1.1.0。
任何帮助都将不胜感激。
答案 0 :(得分:0)
Hive为DECIMAL(22,7)生成的实际Schema - 使用从that LinkedIn utility派生的一些代码来检查实际的Parquet文件 - 看起来像......
optional fixed_len_byte_array(10) my_dec_22_7;
{ "name":"my_dec_22_7","type":["null",{"type":"fixed", "name":"my_dec_22_7","size":10} ], "default":null }
...其中10似乎是将带有22位数的BigInteger
转储到byte[]
所需的字节数。例如,查看AvroSerdeUtils源代码及其转储HiveDecimal
的方式。
话虽这么说,我真的不知道如何在Parquet文件中读/写DECIMAL。 DOUBLE和BIGINT更容易处理,因为它们受IEEE标准类型(以及AVRO标准类型和Parquet标准类型)的支持。