Question

我需要解析一个大的csv文件（2gb）。必须验证值，必须删除包含“坏”字段的行，并且应输出仅包含有效行的新文件。

我选择了uniVocity解析器库来做到这一点。请帮助我理解这个库是否适合该任务以及应该采用什么方法。

考虑到文件大小，在uniVocity中组织read-＆gt; validate-＆gt;写入的最佳方法是什么？一次读入所有行或使用迭代器样式？解析和验证的行应该在写入文件之前存储？
Univocity有没有办法通过索引访问行的值？像row.getValue（3）？

Answer 1

我是这个图书馆的作者，让我试着帮助你：

首先，不要试图一次读取所有行，因为你将用大量数据填满你的记忆。
您可以按索引获取行值。

更快的读取/验证/写入方法是使用具有RowProcessor的{{1}}并决定何时写入或跳过一行。我认为以下代码对您有所帮助：

定义输出：

CsvWriter

重定向输入

private CsvWriter createCsvWriter(File output, String encoding){
    CsvWriterSettings settings = new CsvWriterSettings();
    //configure the writer ...

    try {
        return new CsvWriter(new OutputStreamWriter(new FileOutputStream(output), encoding), settings);
    } catch (IOException e) {
        throw new IllegalArgumentException("Error writing to " + output.getAbsolutePath(), e);
    }
}

配置解析器：

//this creates a row processor for our parser. It validates each row and sends them to the csv writer.
private RowProcessor createRowProcessor(File output, String encoding){
    final CsvWriter writer = createCsvWriter(output, encoding);
    return new AbstractRowProcessor() {

        @Override
        public void rowProcessed(String[] row, ParsingContext context) {
            if (shouldWriteRow(row)) {
                writer.writeRow(row);
            } else {
                //skip row
            }
        }

        private boolean shouldWriteRow(String[] row) {
            //your validation here
            return true;
        }

        @Override
        public void processEnded(ParsingContext context) {
            writer.close();
        }
    };
}

为了获得更好的性能，您还可以将行处理器包装在public void readAndWrite(File input, File output, String encoding) { CsvParserSettings settings = new CsvParserSettings(); //configure the parser here //tells the parser to send each row to them custom processor, which will validate and redirect all rows to the CsvWriter settings.setRowProcessor(createRowProcessor(output, encoding)); CsvParser parser = new CsvParser(settings); try { parser.parse(new InputStreamReader(new FileInputStream(input), encoding)); } catch (IOException e) { throw new IllegalStateException("Unable to open input file " + input.getAbsolutePath(), e); } }中。

ConcurrentRowProcessor

这样，行的写入将在一个单独的线程中执行。

读入大型csv文件，使用uniVocity解析器进行验证和写出

1 个答案:

定义输出：

重定向输入

配置解析器：