我正在尝试读取Parquet数据(1.8.1 https://github.com/apache/parquet-mr),其架构包含嵌套在包装器记录中的记录,该记录也嵌套在数组中。 E.g:
{
"type": "record",
"name": "record",
"fields": [
{
"name": "elements",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "elementWrapper",
"fields": [
{
"name": "array_element",
"type": {
"type": "record",
"name": "element",
"namespace": "test",
"fields": [
{
"name": "someField",
"type": "int"
}
]
}
}
]
}
}
}
]
}
使用ParquetFileReader
读取具有上述模式的镶木地板文件时,我可以看到该文件具有以下模式,看起来是正确的:
message record {
required group elements (LIST) {
repeated group array {
required group array_element {
required int32 someField;
}
}
}
}
但是,当尝试使用Avro界面从此文件中读取记录时(见下文),我得到InvalidRecordException
。
final ParquetReader<GenericRecord> parquetReader = AvroParquetReader.<GenericRecord>builder(path).build();
final GenericRecord read = parquetReader.read();
单步执行代码,看起来当记录转换为Avro时,字段“someField”不在范围内。只有架构顶层的字段在范围内。
预计Avro Parquet不支持此架构吗?这是AvroRecordConverter中的错误吗?
谢谢!
堆栈跟踪:
org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch: Avro field 'someField' not found
at org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:220)
at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:125)
at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:274)
at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:227)
at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:73)
at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.<init>(AvroRecordConverter.java:531)
at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:481)
at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:284)
at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:136)
at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:90)
at org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:132)
at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:175)
at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:149)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:125)
下面是使用此架构创建Parquet文件的完整代码,然后无法读取它:
@Test
@SneakyThrows
public void canReadWithNestedArray() {
final Path path = new Path("test-resources/" + UUID.randomUUID());
// Construct a record that defines the final nested value we can't read
final Schema element = Schema.createRecord("element", null, "test", false);
element.setFields(Arrays.asList(new Schema.Field("someField", Schema.create(Schema.Type.INT), null, null)));
// Create a wrapper for above nested record
final Schema elementWrapper = Schema.createRecord("elementWrapper", null, null, false);
elementWrapper.setFields(Arrays.asList(new Schema.Field("array_element", element, null, null)));
// Create top level field that contains array of wrapped records
final Schema.Field topLevelArrayOfWrappers = new Schema.Field("elements", Schema.createArray(elementWrapper), null, null);
final Schema topLevelElement = Schema.createRecord("record", null, null, false);
topLevelElement.setFields(Arrays.asList(topLevelArrayOfWrappers));
final GenericRecord genericRecord = new GenericData.Record(topLevelElement);
// Create element
final GenericData.Record recordValue = new GenericData.Record(element);
recordValue.put("someField", 5);
// Create element of array, wrapper containing above element
final GenericData.Record wrapperValue = new GenericData.Record(elementWrapper);
wrapperValue.put("array_element", recordValue);
genericRecord.put(topLevelArrayOfWrappers.name(), Arrays.asList(wrapperValue));
AvroParquetWriter.Builder<GenericRecord> fileWriterBuilder = AvroParquetWriter.<GenericRecord>builder(path).withSchema(topLevelElement);
final ParquetWriter<GenericRecord> fileWriter = fileWriterBuilder.build();
fileWriter.write(genericRecord);
fileWriter.close();
final ParquetFileReader parquetFileReader = ParquetFileReader.open(new Configuration(), path);
final FileMetaData fileMetaData = parquetFileReader.getFileMetaData();
System.out.println(fileMetaData.getSchema().toString());
final ParquetReader<GenericRecord> parquetReader = AvroParquetReader.<GenericRecord>builder(path).build();
final GenericRecord read = parquetReader.read();
}
我还在Apache实木复合地板JIRA中开了一个问题:https://issues.apache.org/jira/browse/PARQUET-1254