Question

我想比较2个csv文件。 csv文件的行和列可能未按顺序排列。例如：

CSV1：

CSV2：

我们需要显示如下输出：

我尝试使用以下集合：

def get_set(file):
    # initialize the set
    set_of_rows = set()

    # open the file
    fh = open(file)

    # get the headers
    headers = fh.readline().strip().split(',')

    # iterate over the lines and form the set
    for line in fh:
        # cells in a list
        cells = line.strip().split(',')

        # placeholder for the cells associated with the headers
        elements = []
        for header in sorted(headers):
            index = headers.index(header) # cannot use enumerate above
            elements.append("::".join([header, cells[index]]))

        # 'header1::a1', 'header2::b1', 'header3::c1', 'header4::d1'
        tuple_elements = tuple(elements)
        set_of_rows.add(tuple_elements)

    fh.close()
    return set_of_rows

import pprint
pp = pprint.PrettyPrinter(4)

# create the sets
s1 = get_set('csv1.csv')
s2 = get_set('csv2.csv')

pp.pprint(s1)
print('#######')
pp.pprint(s2)
print('#######')

s = s1.union(s2)
print('Union')
pp.pprint(s)

print('#######')
print('Diff')
pp.pprint(s.difference(s1))

在这里我可以找到差异，但是无法指出差异区域，并且一个csv文件中几乎有10 ^ 6行。

我们可以将csv文件放入Pandas DataFrame中并进行处理吗？

预先感谢

比较CSV文件并并排显示差异

0 个答案: