更新

Question

我有一个python脚本，用于抓取网站并在csv文件中下载一些数据。

我每周都会运行这个脚本。现在我想比较2周的csv并找出这些2 csv中哪一行已被更改。

csv中的数据为98％相同，只添加或删除1或2行。

我无法得到任何适当的解决方案。我使用DictReader并尝试比较内容但没有成功。

任何指针如此解决可能有所帮助，我还读到我可以将这些转换为set然后再做setA - setB

如果有帮助，我会给出CSV的格式。

file1.csv

name,userId,location
aaa,abc,NYC
bbb,cdf,UCL

file2.csv

name,userId,location
bbb,cdf,UCL

现在，如果您看到，在file2.csv中删除了一行，那么当我比较file1.csv和file2.csv时，我应该能够获得值aaa,abc,NYC

Answer 1

尝试diff命令：

diff file1.csv file2.csv

根据您使用的操作系统，您可能需要找到适合您系统的副本。

Answer 2

是的，设置差异有效。

with open('file1.csv') as f, open('file2.csv') as g:
    old, new = set(f), set(g)
for added in new - old:
    print('added', added)
for deleted in old - new:
    print('deleted', deleted)

Answer 3

您可以使用pandas库。

<强> a.csv

name,userId,location
aaa,abc,NYC
bbb,cdf,UCL
ccc,dfg,LAC
ddd,fgh,SAC

<强> b.csv

name,userId,location
bbb,cdf,UCL
ccc,dfg,LAC

<强>代码：

import pandas as pd

a = pd.read_csv('a.csv')
b = pd.read_csv('b.csv')

mask = a.isin(b.to_dict(orient='list'))
# Reverse the mask and remove null rows.
# Upside is that index of original rows that
# are now gone are preserved (see result).
c = a[~mask].dropna()
print c

<强>结果：

  name userId location
0  aaa    abc      NYC
3  ddd    fgh      SAC
[Finished in 0.7s]

pandas由于在此处使用numpy，Cython和一些原始C实现的组合而具有优化优势。

Answer 4

如果文件非常相似，您可以比较文件之间的行。

我正在使用您发布的两个示例文件。

file1 = []
file2 = []
with open('file1.csv','r') as f, open('file2.csv','r') as g:
    file1, file2 = list(f), list(g)

for line in file1:
    if line not in file2:
        print "Difference:",line

输出：

Difference: aaa,abc,NYC

更新

我喜欢@Stefan Pochmann解决方案，对我来说这将是最优雅的方式：

with open('file1.csv','r') as f, open('file2.csv','r') as g:
    file1, file2 = set(f), set(g)
    print "Difference:", list(file1 - file2)

如何比较2个csv文件与＆gt; 1000行，找到差异？

4 个答案:

更新