我在使用镶木地板的UnboundRecordFilter作为可以为空的列时遇到了问题。
我的avro记录如下:
[{"namespace": "com.test.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "userId", "type": "long"},
{"name": "userType", "type": ["null", "string"], "default":null}
]
}]
我使用列谓词进行过滤以读取具有null userType
的记录public class NullableUserTypeFilter implements UnboundRecordFilter{
NullUserTypePredicateFunction nullPredicate;
public NullableUserTypeFilter() {
this.nullPredicate = new NullUserTypePredicateFunction();
}
@Override
public RecordFilter bind(Iterable<ColumnReader> readers) {
return ColumnRecordFilter.column("userType", nullPredicate).bind(readers);
}
class NullUserTypePredicateFunction implements Predicate{
public NullUserTypePredicateFunction(){}
@Override
public boolean apply(ColumnReader input) {
return input.getBinary()==null || input.getBinary().toStringUsingUTF8()==null;
}
}
}
在运行我的工作时
java.lang.Exception: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file file:/Users/kinga/repo/test/test-parquet/target/input/UserSnapshot/0/users-2000.01.01-test.parquet
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file file:/Users/kinga/repo/test/test-parquet/target/input/UserSnapshot/0/users-2000.01.01-test.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:146)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:553)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value in column [userType] BINARY at value 97 out of 100, 97 out of 100 in currentPage. repetition level: 0, definition level: 1
at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:483)
at org.apache.parquet.column.impl.ColumnReaderImpl.getBinary(ColumnReaderImpl.java:416)
at com.test.parquet.filter.NullableUserTypeFilter$NullUserTypePredicateFunction.apply(NullableUserTypeFilter.java:31)
at org.apache.parquet.filter.ColumnRecordFilter.isMatch(ColumnRecordFilter.java:72)
at org.apache.parquet.io.FilteredRecordReader.skipToMatch(FilteredRecordReader.java:80)
at org.apache.parquet.io.FilteredRecordReader.read(FilteredRecordReader.java:60)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
... 14 more
Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream.
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
at org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)
at org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)
at org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readValueDictionaryId(DictionaryValuesReader.java:76)
at org.apache.parquet.column.impl.ColumnReaderImpl$1.read(ColumnReaderImpl.java:166)
at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464)
... 20 more
问题在于读取具有空值的记录。 什么是处理可空字段的正确方法?