我正在使用以下实木复合地板依赖性来读取实木复合地板
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-common</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-encoding</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-column</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-hadoop</artifactId>
<version>1.8.1</version>
</dependency>
当使用大块写入文件时,一个350mb的文件由一个块组成,fileReader.readNextRowGroup()会引发堆错误,导致内存不足。见底部代码
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:778)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:511)
读取大块文件的正确方法是什么?我也尝试一次读取一列,但是在单列文件的边缘情况下,单个块大于JVM内存将遇到相同的问题。
没有办法使用流吗?
我们要避免使用hadoop或spark。
try {
ParquetMetadata readFooter = ParquetFileReader.readFooter(hfsConfig, path, ParquetMetadataConverter.NO_FILTER);
MessageType schema = readFooter.getFileMetaData().getSchema();
long max = readFooter.getBlocks().stream
().reduce(0L, (left, right) -> left > right.getTotalByteSize() ?
left : right.getTotalByteSize(), (leftl, rightl) -> leftl > rightl ? leftl :
rightl);
LOGGER.info("blocks: {} largest block {}", readFooter.getBlocks().size(), max);
for (BlockMetaData block : readFooter.getBlocks()) {
try {
fileReader = new ParquetFileReader(hfsConfig, readFooter.getFileMetaData(), path, Collections
.singletonList(block), schema.getColumns());
PageReadStore pages;
while (null != (pages = fileReader.readNextRowGroup())) {//exception gets thrown here on blocks larger than jvm memory
final long rows = pages.getRowCount();
final MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
final RecordReader<Group> recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
for (int i = 0; i < rows; i++) {
final Group group = recordReader.read();
int fieldCount = group.getType().getFieldCount();
for (int field = 0; field < fieldCount; field++) {
int valueCount = group.getFieldRepetitionCount(field);
Type fieldType = group.getType().getType(field);
String fieldName = fieldType.getName();
for (int index = 0; index < valueCount; index++) {
}
}
}
}
} catch (IOException e) {
return Try.failure(e);
} finally {
try {
if (fileReader != null) {
fileReader.close();
}
} catch (IOException ex) {
}
}
}
} catch (IOException e) {
return Try.failure(e);
}