所以我想要比较两个csv文件。 每个文件可能多达20mb。 每行都有密钥后跟数据,所以密钥是数据 但是数据也会用逗号分隔。
csv1.csv
KEY , DATA
AB45,12,15,65,NN
AB46,12,15,64,YY
AB47,45,85,95,YN
csv2.csv
AB45,12,15,65,NN
AB46,15,15,65,YY
AB48,65,45,60,YY
我想要做的是读取两个文件并比较每个密钥的数据。
我正在考虑将每个文件逐行添加到TreeMap中。然后,我可以比较给定密钥的每组数据,如果存在差异,则将其写入另一个文件。
有什么建议吗? 因为我不确定如何读取文件以便以有效的方式提取密钥和数据。
答案 0 :(得分:3)
使用专用的CSV解析库来加快速度。使用uniVocity-parsers,您可以在100毫秒或更短的时间内解析这些20mb文件。以下解决方案涉及到防止将过多数据加载到内存中。查看我上面链接的教程,有很多方法可以用这个库来完成你需要的东西。
public static void main(String... args) {
//First we parse one file (ideally the smaller one)
CsvParserSettings settings = new CsvParserSettings();
//here we tell the parser to read the CSV headers
settings.setHeaderExtractionEnabled(true);
CsvParser parser = new CsvParser(settings);
//Parse all data into a list.
List<String[]> records = parser.parseAll(new File("/path/to/csv1.csv"));
//Convert that list into a map. The first column of this input will produce the keys.
Map<String, String[]> mapOfRecords = toMap(records);
//this where the magic happens.
processFile(new File("/path/to/csv2.csv"), new File("/path/to/diff.csv"), mapOfRecords);
}
/* Converts a list of records to a map. Uses element at index 0 as the key */
private static Map<String, String[]> toMap(List<String[]> records) {
HashMap<String, String[]> map = new HashMap<String, String[]>();
for (String[] row : records) {
//column 0 will always have an ID.
map.put(row[0], row);
}
return map;
}
private static void processFile(final File input, final File output, final Map<String, String[]> mapOfExistingRecords) {
//configures a new parser again
CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true);
//All parsed rows will be submitted to the following Processor. This way you won't have to store all rows in memory.
settings.setProcessor(new RowProcessor() {
//will write the changed rows to another file
CsvWriter writer;
@Override
public void processStarted(ParsingContext context) {
CsvWriterSettings settings = new CsvWriterSettings(); //configure at till
writer = new CsvWriter(output, settings);
}
@Override
public void rowProcessed(String[] row, ParsingContext context) {
// Incoming rows from will have the ID as index 0.
// If the map contains the ID, we'll get a row
String[] existingRow = mapOfExistingRecords.get(row[0]);
if (!Arrays.equals(row, existingRow)) {
writer.writeRow(row);
}
}
@Override
public void processEnded(ParsingContext context) {
writer.close();
}
});
CsvParser parser = new CsvParser(settings);
//the parse() method will submit all rows to the RowProcessor defined above. All differences will be
//written to the output file.
parser.parse(input);
}
这应该可以正常工作。我希望它可以帮助你。
披露:我是这个图书馆的作者。它是开源和免费的(Apache V2.0许可证)。
答案 1 :(得分:0)
我为我的工作进行了大量的CSV文件比较。在我开始工作之前,我并不知道python,但我很快就把它拿起来了。如果你想快速比较CSV文件,python是一个很好的方法,如果你了解java,它很容易拿起。
我修改了一个我用来适合你的基本用例的脚本(你需要更多地修改它以完全按照自己的意愿行事)。它在几秒钟内运行,当我使用它比较数百万行的csv文件。如果你需要在java中这样做,你几乎可以将它转移到一些java方法。您可以使用类似的csv库来替换下面的所有csv函数。
import csv, sys, itertools
def getKeyPosition(header_row, key_value):
counter = 0
for header in header_row:
if (header == key_value):
return counter
counter += 1
# This will create a dictonary of your rows by their key. (key is the column location)
def getKeyDict(csv_reader, key_position):
key_dict = {}
row_counter = 0
unique_records = 0
for row in csv_reader:
row_counter += 1
if row[key_position] not in key_dict:
key_dict.update({row[key_position]: row})
unique_records += 1
# My use case requires a lot of checking for duplicates
if unique_records != row_counter:
print "Duplicate Keys in File"
return key_dict
def main():
f1 = open(sys.argv[1])
f2 = open(sys.argv[2])
f1_csv = csv.reader(f1)
f2_csv = csv.reader(f2)
f1_header = next(f1_csv)
f2_header = next(f2_csv)
f1_header_key_position = getKeyPosition(f1_header, "KEY")
f2_header_key_position = getKeyPosition(f2_header, "KEY")
f1_row_dict = getKeyDict(f1_csv, f1_header_key_position)
f2_row_dict = getKeyDict(f2_csv, f2_header_key_position)
outputFile = open("KeyDifferenceFile.csv" , 'w')
writer = csv.writer(outputFile)
writer.writerow(f1_header)
#Heres the logic for comparing rows
for key, row_1 in f1_row_dict.iteritems():
#Do whatever comparisions you need here.
if key not in f2_row_dict:
print "Oh no, this key doesn't exist in the file 2"
if key in f2_row_dict:
row_2 = f2_row_dict.get(key)
if row_1 != row_2:
print "oh no, the two rows don't match!"
# You can get more header keys to compare by if you want.
data_position = getKeyPosition(f2_header, "DATA")
row_1_data = row_1[data_position]
row_2_data = row_2[data_position]
if row_1_data != row_2_data:
print "oh no, the data doesn't match!"
# Heres how you'd right the rows
row_to_write = []
#Differences between
for row_1_column, row_2_column in itertools.izip(row_1_data, row_2_data):
row_to_write.append(row_1_column - row_2_column)
writer.writerow(row_to_write)
# Make sure to close those files!
f1.close()
f2.close()
outputFile.close()
main()