Question

我有两个很大的.csv文件，我想使用csv DictReader甚至是pandas逐行比较两列。

我需要检查两个文件中特定列的所有行是否相同。我在这里看到了一些建议，但在我的情况下没有一个起作用。问题是即使文件相同，第二个打开的文件的迭代顺序也不正确。

我已经使用openpyxl快速完成了搜索和修改任务，但是由于csv文件大小为几百MB，因此即使在运行时将csv转换为excel也似乎不是一个好决定。

这就是我现在在代码方面的知识：

import csv

class CsvCompareTester:

    work_csv_path = None
    test_csv_path = None

    @staticmethod
    def insert_file_paths():
        print()
        print('Enter the full absolute path of the WORK .csv file:')
        CsvCompareTester.work_csv_path = input()

        print('Enter the full absolute path of the TEST .csv file:')
        CsvCompareTester.test_csv_path = input()

    @staticmethod
    def compare_files(work_csv_file, test_csv_file):

        work_csv_obj = csv.DictReader(work_csv_file, delimiter=";")
        test_csv_obj = csv.DictReader(test_csv_file, delimiter=";")

        for work_row in work_csv_obj:
            for test_row in test_csv_obj:
                if work_row == test_row:
                    print('ALL CLEAR')
                    print(str(work_row))
                    print(str(test_row))
                    print()
                else:
                    print("STRINGS DON'T MATCH")
                    print(str(work_row))
                    print(str(test_row))
                    print()


if __name__ == "__main__":
    csv_tester = CsvCompareTester()
    csv_tester.insert_file_paths()

    with open(CsvCompareTester.work_csv_path) as work_file:
        with open(CsvCompareTester.test_csv_path) as test_file:
            csv_tester.compare_files(work_file, test_file)

我如何遍历.csv文件行，同时还能够通过键或值来寻址特定的行和列（这肯定会减少无用的迭代次数）。由于某种原因，在上面的代码中，第一个文件中的每个行字符串与第二个文件中的其他行都不匹配。文件是相同的，并且具有相同的条目顺序，我已经对其进行了仔细检查。为什么从头到尾都不将第二个文件迭代为第一个文件？

Answer 1

问题在于您如何遍历文件。按照这种方式，尝试将第一个文件的每一行与第二个文件的每行进行比较。取而代之的是，您需要以锁步的方式获取其中的行，而内置的zip()函数是实现此目的的一种好方法。

请改为这样做：

    @staticmethod
    def compare_files(work_csv_file, test_csv_file):

        work_csv_obj = csv.DictReader(work_csv_file, delimiter=";")
        test_csv_obj = csv.DictReader(test_csv_file, delimiter=";")

#        for work_row in work_csv_obj:
#            for test_row in test_csv_obj:

        for work_row, test_row in zip(work_csv_obj, test_csv_obj):
            if work_row == test_row:
                print('ALL CLEAR')
                print(str(work_row))
                print(str(test_row))
                print()
            else:
                print("STRINGS DON'T MATCH")
                print(str(work_row))
                print(str(test_row))
                print()

顺便说一句，即使它可能尚未引起任何问题，我也注意到您没有按照csv.DictReader文档中所示正确打开两个文件-您省略了newline=''论点。

这是正确的方法：

if __name__ == "__main__":
    csv_tester = CsvCompareTester()
    csv_tester.insert_file_paths()

#    with open(CsvCompareTester.work_csv_path) as work_file:
#        with open(CsvCompareTester.test_csv_path) as test_file:

    with open(CsvCompareTester.work_csv_path, newline='') as work_file:
        with open(CsvCompareTester.test_csv_path, newline='') as test_file:
            csv_tester.compare_files(work_file, test_file)

用csv.DictReader逐行比较两个.csv文件中的两列

1 个答案: