我正在尝试阅读大约CSV
和TSV
(Tab sepperated)大约1000000
行或更多行的文件。现在,我尝试使用opencsv
阅读包含TSV
行的~2500000
,但它会向我发送java.lang.NullPointerException
。它适用于带有TSV
行的较小~250000
个文件。所以我想知道是否还有其他Libraries
支持阅读巨大的CSV
和TSV
文件。你有什么想法吗?
每个对我的代码感兴趣的人(我缩短它,所以Try-Catch
显然无效):
InputStreamReader in = null;
CSVReader reader = null;
try {
in = this.replaceBackSlashes();
reader = new CSVReader(in, this.seperator, '\"', this.offset);
ret = reader.readAll();
} finally {
try {
reader.close();
}
}
编辑:这是我构建InputStreamReader
:
private InputStreamReader replaceBackSlashes() throws Exception {
FileInputStream fis = null;
Scanner in = null;
try {
fis = new FileInputStream(this.csvFile);
in = new Scanner(fis, this.encoding);
ByteArrayOutputStream out = new ByteArrayOutputStream();
while (in.hasNext()) {
String nextLine = in.nextLine().replace("\\", "/");
// nextLine = nextLine.replaceAll(" ", "");
nextLine = nextLine.replaceAll("'", "");
out.write(nextLine.getBytes());
out.write("\n".getBytes());
}
return new InputStreamReader(new ByteArrayInputStream(out.toByteArray()));
} catch (Exception e) {
in.close();
fis.close();
this.logger.error("Problem at replaceBackSlashes", e);
}
throw new Exception();
}
答案 0 :(得分:12)
不要使用CSV解析器来解析TSV输入。例如,如果TSV具有带引号字符的字段,它将会中断。
uniVocity-parsers附带一个TSV解析器。您可以毫无问题地解析十亿行。
解析TSV输入的示例:
TsvParserSettings settings = new TsvParserSettings();
TsvParser parser = new TsvParser(settings);
// parses all rows in one go.
List<String[]> allRows = parser.parseAll(new FileReader(yourFile));
如果您的输入太大,则无法保存在内存中,请执行以下操作:
TsvParserSettings settings = new TsvParserSettings();
// all rows parsed from your input will be sent to this processor
ObjectRowProcessor rowProcessor = new ObjectRowProcessor() {
@Override
public void rowProcessed(Object[] row, ParsingContext context) {
//here is the row. Let's just print it.
System.out.println(Arrays.toString(row));
}
};
// the ObjectRowProcessor supports conversions from String to whatever you need:
// converts values in columns 2 and 5 to BigDecimal
rowProcessor.convertIndexes(Conversions.toBigDecimal()).set(2, 5);
// converts the values in columns "Description" and "Model". Applies trim and to lowercase to the values in these columns.
rowProcessor.convertFields(Conversions.trim(), Conversions.toLowerCase()).set("Description", "Model");
//configures to use the RowProcessor
settings.setRowProcessor(rowProcessor);
TsvParser parser = new TsvParser(settings);
//parses everything. All rows will be pumped into your RowProcessor.
parser.parse(new FileReader(yourFile));
披露:我是这个图书馆的作者。它是开源和免费的(Apache V2.0许可证)。
答案 1 :(得分:6)
我还没有尝试过,但我之前曾调查过superCSV。
http://sourceforge.net/projects/supercsv/
http://supercsv.sourceforge.net/
检查这是否适用于你,250万行。
答案 2 :(得分:1)
尝试按照Satish
的建议切换库。如果这没有帮助,您必须将整个文件拆分为令牌并处理它们。
认为您的CSV
没有逗号的转义字符
// r is the BufferedReader pointed at your file
String line;
StringBuilder file = new StringBuilder();
// load each line and append it to file.
while ((line=r.readLine())!=null){
file.append(line);
}
// Make them to an array
String[] tokens = file.toString().split(",");
然后你可以处理它。在使用之前不要忘记修剪令牌。
答案 3 :(得分:1)
我不知道这个问题是否仍然有效,但这是我成功使用的问题。仍然可能必须实现更多接口,例如Stream或Iterable,但是:
import java.io.Closeable;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.Scanner;
/** Reader for the tab separated values format (a basic table format without escapings or anything where the rows are separated by tabulators).**/
public class TSVReader implements Closeable
{
final Scanner in;
String peekLine = null;
public TSVReader(InputStream stream) throws FileNotFoundException
{
in = new Scanner(stream);
}
/**Constructs a new TSVReader which produces values scanned from the specified input stream.*/
public TSVReader(File f) throws FileNotFoundException {in = new Scanner(f);}
public boolean hasNextTokens()
{
if(peekLine!=null) return true;
if(!in.hasNextLine()) {return false;}
String line = in.nextLine().trim();
if(line.isEmpty()) {return hasNextTokens();}
this.peekLine = line;
return true;
}
public String[] nextTokens()
{
if(!hasNextTokens()) return null;
String[] tokens = peekLine.split("[\\s\t]+");
// System.out.println(Arrays.toString(tokens));
peekLine=null;
return tokens;
}
@Override public void close() throws IOException {in.close();}
}