我正在尝试比较2个CSV文件以找出差异。这两个文件看起来像这样:
['John'], ['Johnson'], ['1337@john-johnson.pro']
['Steve'], ['Stevens'], ['s.stevens@company.com']
['Sarah'], ['Stevens'], ['sarah.stevens@company.com']
和
['John'], ['Johnson'], ['1337@john-johnson.pro']
...
['Richard'], ['McBait'], ['ilovecats123@mail.mcbait.com']
我要做的是比较这两个文件而不必创建临时文件。该脚本应该能够排除字符[
,'
和]
,读取值,然后将2个文件相互比较,代表“新用户”。
我使用这个(可能是错误的)逻辑来解决这个问题:
read the file -> execute subprocess (tr -d \[\]\') -> save output to file1_temp -> read the file1_temp -> convert to set -> compare (.difference) with file2_tmp
所以,问题是,有没有更快的方法来解决这个问题?例如,在Perl中,通过使用if line
正则表达式来确定将读取哪些数据。
答案 0 :(得分:0)
假设
['John'], ['Johnson'], ['1337@john-johnson.pro']
和
['John'], ['Johnson'], ['1337@john-johnson.pro']
与您的情况不同,将每个csv文件加载到列表中(在内存中)并获取这两个列表的增量(使用set)。
file1.csv:
['John'], ['Johnson'], ['1337@john-johnson.pro']
['Steve'], ['Stevens'], ['s.stevens@company.com']
['Sarah'], ['Stevens'], ['sarah.stevens@company.com']
file2.csv:
['John'], ['Johnson'], ['1337@john-johnson.pro']
['Steve'], ['Stevens'], ['s.stevens@company.com']
['Richard'], ['McBait'], ['ilovecats123@mail.mcbait.com']
以下是代码:
>>> import csv
>>> with open('file1.csv') as f:
... reader = csv.reader(f)
... list1 = map(tuple, reader)
...
>>> with open('file2.csv') as f:
... reader = csv.reader(f)
... list2 = map(tuple, reader)
...
>>> delta = list(set(list2) - set(list1))
>>> print delta
[("['Sarah']", " ['Stevens']", " ['sarah.stevens@company.com']")]
>>> clean_delta = [tuple(x.strip().strip('[\'\']') for x in y) for y in delta]
>>> print clean_delta
[('Sarah', 'Stevens', 'sarah.stevens@company.com')]