适用于Java的优秀且有效的CSV / TSV Reader

时间:2012-12-14 13:52:33

标签: java csv large-files opencsv tsv

我正在尝试阅读大约CSVTSV(Tab sepperated)大约1000000行或更多行的文件。现在,我尝试使用opencsv阅读包含TSV行的~2500000,但它会向我发送java.lang.NullPointerException。它适用于带有TSV行的较小~250000个文件。所以我想知道是否还有其他Libraries支持阅读巨大的CSVTSV文件。你有什么想法吗?

每个对我的代码感兴趣的人(我缩短它,所以Try-Catch显然无效):

InputStreamReader in = null;
CSVReader reader = null;
try {
    in = this.replaceBackSlashes();
    reader = new CSVReader(in, this.seperator, '\"', this.offset);
    ret = reader.readAll();
} finally {
    try {
        reader.close();
    } 
}

编辑:这是我构建InputStreamReader

的方法
private InputStreamReader replaceBackSlashes() throws Exception {
        FileInputStream fis = null;
        Scanner in = null;
        try {
            fis = new FileInputStream(this.csvFile);
            in = new Scanner(fis, this.encoding);
            ByteArrayOutputStream out = new ByteArrayOutputStream();

            while (in.hasNext()) {
                String nextLine = in.nextLine().replace("\\", "/");
                // nextLine = nextLine.replaceAll(" ", "");
                nextLine = nextLine.replaceAll("'", "");
                out.write(nextLine.getBytes());
                out.write("\n".getBytes());
            }

            return new InputStreamReader(new ByteArrayInputStream(out.toByteArray()));
        } catch (Exception e) {
            in.close();
            fis.close();
            this.logger.error("Problem at replaceBackSlashes", e);
        }
        throw new Exception();
    }

4 个答案:

答案 0 :(得分:12)

不要使用CSV解析器来解析TSV输入。例如,如果TSV具有带引号字符的字段,它将会中断。

uniVocity-parsers附带一个TSV解析器。您可以毫无问题地解析十亿行。

解析TSV输入的示例:

TsvParserSettings settings = new TsvParserSettings();
TsvParser parser = new TsvParser(settings);

// parses all rows in one go.
List<String[]> allRows = parser.parseAll(new FileReader(yourFile));

如果您的输入太大,则无法保存在内存中,请执行以下操作:

TsvParserSettings settings = new TsvParserSettings();

// all rows parsed from your input will be sent to this processor
ObjectRowProcessor rowProcessor = new ObjectRowProcessor() {
    @Override
    public void rowProcessed(Object[] row, ParsingContext context) {
        //here is the row. Let's just print it.
        System.out.println(Arrays.toString(row));
    }
};
// the ObjectRowProcessor supports conversions from String to whatever you need:
// converts values in columns 2 and 5 to BigDecimal
rowProcessor.convertIndexes(Conversions.toBigDecimal()).set(2, 5);

// converts the values in columns "Description" and "Model". Applies trim and to lowercase to the values in these columns.
rowProcessor.convertFields(Conversions.trim(), Conversions.toLowerCase()).set("Description", "Model");

//configures to use the RowProcessor
settings.setRowProcessor(rowProcessor);

TsvParser parser = new TsvParser(settings);
//parses everything. All rows will be pumped into your RowProcessor.
parser.parse(new FileReader(yourFile));

披露:我是这个图书馆的作者。它是开源和免费的(Apache V2.0许可证)。

答案 1 :(得分:6)

我还没有尝试过,但我之前曾调查过superCSV。

http://sourceforge.net/projects/supercsv/

http://supercsv.sourceforge.net/

检查这是否适用于你,250万行。

答案 2 :(得分:1)

尝试按照Satish的建议切换库。如果这没有帮助,您必须将整个文件拆分为令牌并处理它们。

认为您的CSV没有逗号的转义字符

// r is the BufferedReader pointed at your file
String line;
StringBuilder file = new StringBuilder();
// load each line and append it to file.
while ((line=r.readLine())!=null){
    file.append(line);
}
// Make them to an array
String[] tokens = file.toString().split(",");

然后你可以处理它。在使用之前不要忘记修剪令牌。

答案 3 :(得分:1)

我不知道这个问题是否仍然有效,但这是我成功使用的问题。仍然可能必须实现更多接口,例如Stream或Iterable,但是:

import java.io.Closeable;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.Scanner;

/** Reader for the tab separated values format (a basic table format without escapings or anything where the rows are separated by tabulators).**/
public class TSVReader implements Closeable 
{
    final Scanner in;
    String peekLine = null;

    public TSVReader(InputStream stream) throws FileNotFoundException
    {
        in = new Scanner(stream);
    }

    /**Constructs a new TSVReader which produces values scanned from the specified input stream.*/
    public TSVReader(File f) throws FileNotFoundException {in = new Scanner(f);}

    public boolean hasNextTokens()
    {
        if(peekLine!=null) return true;
        if(!in.hasNextLine()) {return false;}
        String line = in.nextLine().trim();
        if(line.isEmpty())  {return hasNextTokens();}
        this.peekLine = line;       
        return true;        
    }

    public String[] nextTokens()
    {
        if(!hasNextTokens()) return null;       
        String[] tokens = peekLine.split("[\\s\t]+");
//      System.out.println(Arrays.toString(tokens));
        peekLine=null;      
        return tokens;
    }

    @Override public void close() throws IOException {in.close();}
}