我正在研究这个"程序"从2个大型csv文件中读取数据(逐行),比较文件中的Array元素,并在找到匹配项时,将必要的数据写入第3个文件。我唯一的问题是它很慢。考虑到我有数百万条记录,它每秒读取1-2行,这非常慢。关于如何让它更快的任何想法?这是我的代码:
public class ReadWriteCsv {
public static void main(String[] args) throws IOException {
FileInputStream inputStream = null;
FileInputStream inputStream2 = null;
Scanner sc = null;
Scanner sc2 = null;
String csvSeparator = ",";
String line;
String line2;
String path = "D:/test1.csv";
String path2 = "D:/test2.csv";
String path3 = "D:/newResults.csv";
String[] columns;
String[] columns2;
Boolean matchFound = false;
int count = 0;
StringBuilder builder = new StringBuilder();
FileWriter writer = new FileWriter(path3);
try {
// specifies where to take the files from
inputStream = new FileInputStream(path);
inputStream2 = new FileInputStream(path2);
// creating scanners for files
sc = new Scanner(inputStream, "UTF-8");
// while there is another line available do:
while (sc.hasNextLine()) {
count++;
// storing the current line in the temporary variable "line"
line = sc.nextLine();
System.out.println("Number of lines read so far: " + count);
// defines the columns[] as the line being split by ","
columns = line.split(",");
inputStream2 = new FileInputStream(path2);
sc2 = new Scanner(inputStream2, "UTF-8");
// checks if there is a line available in File2 and goes in the
// while loop, reading file2
while (!matchFound && sc2.hasNextLine()) {
line2 = sc2.nextLine();
columns2 = line2.split(",");
if (columns[3].equals(columns2[1])) {
matchFound = true;
builder.append(columns[3]).append(csvSeparator);
builder.append(columns[1]).append(csvSeparator);
builder.append(columns2[2]).append(csvSeparator);
builder.append(columns2[3]).append("\n");
String result = builder.toString();
writer.write(result);
}
}
builder.setLength(0);
sc2.close();
matchFound = false;
}
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
//then I close my inputStreams, scanners and writer
答案 0 :(得分:1)
使用现有的CSV库而不是滚动自己的CSV库。它将比你现在拥有的强大得多。
但是,您的问题不是CSV解析速度,而是您的算法为O(n ^ 2),对于第一个文件中的每一行,您需要扫描第二个文件。这种算法随着数据的大小而迅速爆炸,当你有数百万行时,你会遇到问题。你需要一个更好的算法。
另一个问题是你要为每次扫描重新解析第二个文件。你应该至少在程序开始时将它作为ArrayList或者其他东西读入内存中,这样你只需加载并解析一次。
答案 1 :(得分:0)
使用univocity-parsers'CSV解析器,因为处理两个文件所需的时间不会超过几秒钟,每个文件都有100万行:
public void diff(File leftInput, File rightInput) {
CsvParserSettings settings = new CsvParserSettings(); //many config options here, check the tutorial
CsvParser leftParser = new CsvParser(settings);
CsvParser rightParser = new CsvParser(settings);
leftParser.beginParsing(leftInput);
rightParser.beginParsing(rightInput);
String[] left;
String[] right;
int row = 0;
while ((left = leftParser.parseNext()) != null && (right = rightParser.parseNext()) != null) {
row++;
if (!Arrays.equals(left, right)) {
System.out.println(row + ":\t" + Arrays.toString(left) + " != " + Arrays.toString(right));
}
}
leftParser.stopParsing();
rightParser.stopParsing();
}
披露:我是这个图书馆的作者。它是开源和免费的(Apache V2.0许可证)。