读取块大于内存的镶木地板文件

时间:2018-07-26 15:02:35

标签: parquet

我正在使用以下实木复合地板依赖性来读取实木复合地板

    <dependency>
        <groupId>org.apache.parquet</groupId>
        <artifactId>parquet-common</artifactId>
        <version>1.8.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.parquet</groupId>
        <artifactId>parquet-encoding</artifactId>
        <version>1.8.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.parquet</groupId>
        <artifactId>parquet-column</artifactId>
        <version>1.8.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.parquet</groupId>
        <artifactId>parquet-hadoop</artifactId>
        <version>1.8.1</version>
    </dependency>

当使用大块写入文件时,一个350mb的文件由一个块组成,fileReader.readNextRowGroup()会引发堆错误,导致内存不足。见底部代码

Caused by: java.lang.OutOfMemoryError: Java heap space
 at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:778)
 at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:511)

读取大块文件的正确方法是什么?我也尝试一次读取一列,但是在单列文件的边缘情况下,单个块大于JVM内存将遇到相同的问题。

没有办法使用流吗?

我们要避免使用hadoop或spark。

    try {
        ParquetMetadata readFooter = ParquetFileReader.readFooter(hfsConfig, path, ParquetMetadataConverter.NO_FILTER);
        MessageType schema = readFooter.getFileMetaData().getSchema();
        long max = readFooter.getBlocks().stream
                ().reduce(0L, (left, right) -> left > right.getTotalByteSize() ?
                left : right.getTotalByteSize(), (leftl, rightl) -> leftl > rightl ? leftl :
                rightl);

        LOGGER.info("blocks: {} largest block {}", readFooter.getBlocks().size(), max);
        for (BlockMetaData block : readFooter.getBlocks()) {
            try {
                fileReader = new ParquetFileReader(hfsConfig, readFooter.getFileMetaData(), path, Collections
                        .singletonList(block), schema.getColumns());
                PageReadStore pages;

                while (null != (pages = fileReader.readNextRowGroup())) {//exception gets thrown here on blocks larger than jvm memory
                    final long rows = pages.getRowCount();

                    final MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
                    final RecordReader<Group> recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));

                    for (int i = 0; i < rows; i++) {
                        final Group group = recordReader.read();
                        int fieldCount = group.getType().getFieldCount();

                        for (int field = 0; field < fieldCount; field++) {
                            int valueCount = group.getFieldRepetitionCount(field);
                            Type fieldType = group.getType().getType(field);
                            String fieldName = fieldType.getName();

                            for (int index = 0; index < valueCount; index++) {

                            }
                        }
                    }
                }
            } catch (IOException e) {

                return Try.failure(e);
            } finally {
                try {
                    if (fileReader != null) {
                        fileReader.close();
                    }
                } catch (IOException ex) {

                }
            }
        }
    } catch (IOException e) {

        return Try.failure(e);
    }

0 个答案:

没有答案