Java比较两个csv文件

时间:2016-12-05 22:22:30

标签: java csv dictionary treemap

所以我想要比较两个csv文件。 每个文件可能多达20mb。 每行都有密钥后跟数据,所以密钥是数据 但是数据也会用逗号分隔。

csv1.csv

KEY ,   DATA    

AB45,12,15,65,NN
AB46,12,15,64,YY
AB47,45,85,95,YN

csv2.csv

AB45,12,15,65,NN
AB46,15,15,65,YY
AB48,65,45,60,YY

我想要做的是读取两个文件并比较每个密钥的数据。

我正在考虑将每个文件逐行添加到TreeMap中。然后,我可以比较给定密钥的每组数据,如果存在差异,则将其写入另一个文件。

有什么建议吗? 因为我不确定如何读取文件以便以有效的方式提取密钥和数据。

2 个答案:

答案 0 :(得分:3)

使用专用的CSV解析库来加快速度。使用uniVocity-parsers,您可以在100毫秒或更短的时间内解析这些20mb文件。以下解决方案涉及到防止将过多数据加载到内存中。查看我上面链接的教程,有很多方法可以用这个库来完成你需要的东西。

首先,我们读取其中一个CSV文件,然后生成一个Map:

public static void main(String... args) {
    //First we parse one file (ideally the smaller one)
    CsvParserSettings settings = new CsvParserSettings();
    //here we tell the parser to read the CSV headers
    settings.setHeaderExtractionEnabled(true);

    CsvParser parser = new CsvParser(settings);

    //Parse all data into a list.
    List<String[]> records = parser.parseAll(new File("/path/to/csv1.csv"));
    //Convert that list into a map. The first column of this input will produce the keys.
    Map<String, String[]> mapOfRecords = toMap(records);

    //this where the magic happens.
    processFile(new File("/path/to/csv2.csv"), new File("/path/to/diff.csv"), mapOfRecords);

}

这是从记录列表中生成Map的代码:

    /* Converts a list of records to a map. Uses element at index 0 as the key */
private static Map<String, String[]> toMap(List<String[]> records) {
    HashMap<String, String[]> map = new HashMap<String, String[]>();
    for (String[] row : records) {
        //column 0 will always have an ID.
        map.put(row[0], row);
    }
    return map;
}

通过记录地图,我们可以处理您的第二个文件,并在找到任何更新时生成另一个文件:

private static void processFile(final File input, final File output, final Map<String, String[]> mapOfExistingRecords) {
    //configures a new parser again
    CsvParserSettings settings = new CsvParserSettings();
    settings.setHeaderExtractionEnabled(true);

    //All parsed rows will be submitted to the following Processor. This way you won't have to store all rows in memory.
    settings.setProcessor(new RowProcessor() {
        //will write the changed rows to another file
        CsvWriter writer;

        @Override
        public void processStarted(ParsingContext context) {
            CsvWriterSettings settings = new CsvWriterSettings(); //configure at till
            writer = new CsvWriter(output, settings);
        }

        @Override
        public void rowProcessed(String[] row, ParsingContext context) {
            // Incoming rows from will have the ID as index 0.
            // If the map contains the ID, we'll get a row
            String[] existingRow = mapOfExistingRecords.get(row[0]);

            if (!Arrays.equals(row, existingRow)) {
                writer.writeRow(row);
            }
        }

        @Override
        public void processEnded(ParsingContext context) {
            writer.close();
        }
    });

    CsvParser parser = new CsvParser(settings);
    //the parse() method will submit all rows to the RowProcessor defined above. All differences will be
    //written to the output file.
    parser.parse(input);
}

这应该可以正常工作。我希望它可以帮助你。

披露:我是这个图书馆的作者。它是开源和免费的(Apache V2.0许可证)。

答案 1 :(得分:0)

我为我的工作进行了大量的CSV文件比较。在我开始工作之前,我并不知道python,但我很快就把它拿起来了。如果你想快速比较CSV文件,python是一个很好的方法,如果你了解java,它很容易拿起。

我修改了一个我用来适合你的基本用例的脚本(你需要更多地修改它以完全按照自己的意愿行事)。它在几秒钟内运行,当我使用它比较数百万行的csv文件。如果你需要在java中这样做,你几乎可以将它转移到一些java方法。您可以使用类似的csv库来替换下面的所有csv函数。

import csv, sys, itertools

def getKeyPosition(header_row, key_value):
    counter = 0
    for header in header_row:
        if (header == key_value):
            return counter
        counter += 1 

# This will create a dictonary of your rows by their key. (key is the column location)
def getKeyDict(csv_reader, key_position):
    key_dict = {}

    row_counter = 0
    unique_records = 0
    for row in csv_reader:
        row_counter += 1
        if row[key_position] not in key_dict:
            key_dict.update({row[key_position]: row})
            unique_records += 1

    # My use case requires a lot of checking for duplicates 
    if unique_records != row_counter:
        print "Duplicate Keys in File"

    return key_dict

def main():
    f1 = open(sys.argv[1]) 
    f2 = open(sys.argv[2])
    f1_csv = csv.reader(f1)
    f2_csv = csv.reader(f2)

    f1_header = next(f1_csv)
    f2_header = next(f2_csv)
    f1_header_key_position = getKeyPosition(f1_header, "KEY")
    f2_header_key_position = getKeyPosition(f2_header, "KEY")

    f1_row_dict = getKeyDict(f1_csv, f1_header_key_position)
    f2_row_dict = getKeyDict(f2_csv, f2_header_key_position)

    outputFile = open("KeyDifferenceFile.csv" , 'w')
    writer = csv.writer(outputFile)
    writer.writerow(f1_header)


    #Heres the logic for comparing rows
    for key, row_1 in f1_row_dict.iteritems():
        #Do whatever comparisions you need here. 
        if key not in f2_row_dict:
            print "Oh no, this key doesn't exist in the file 2"

        if key in f2_row_dict:
            row_2 = f2_row_dict.get(key)

            if row_1 != row_2:
                print "oh no, the two rows don't match!"

            # You can get more header keys to compare by if you want.
            data_position = getKeyPosition(f2_header, "DATA")
            row_1_data = row_1[data_position]
            row_2_data = row_2[data_position]
            if row_1_data != row_2_data:
                print "oh no, the data doesn't match!"

            # Heres how you'd right the rows 
                row_to_write = []

                #Differences between
                for row_1_column, row_2_column in itertools.izip(row_1_data, row_2_data):
                    row_to_write.append(row_1_column - row_2_column)

                writer.writerow(row_to_write)


    # Make sure to close those files! 
    f1.close()
    f2.close()
    outputFile.close()

main()