Question

我是python的新手。我想创建一个比较两个文件的脚本，输出将包含匹配的文本。

我想将文件1与文件2进行比较。两个文件都包含每行的电子邮件地址。

Answer 1

这样的问题的关键是不考虑“文件”，而是考虑数据。什么是文件？它只是一个可迭代的线条。所以你只是问如何从一个迭代中找到所有值，这些值也在另一个可迭代中。这很容易。

with open('file1') as f1, open('file2') as f2:
    matches = set(f1).intersection(f2)

set构造函数接受任何可迭代的文件 - 并从中创建一个集合。

intersection方法采用任何可迭代的文件 - 并为您提供self集和可迭代的所有元素。因此，在这种情况下，它是file1的所有行的集合中的所有元素，以及file2的所有行的可迭代。

这是一个展示它正常工作的例子：

文件1：

johndoe@example.com
marysmith@example.com
rowdyroddypiper@example.com

file2的：

janesmith@example.com
rowdyroddypiper@example.com
jackjohnson@example.com
marysmith@example.com

代码：

>>> with open('file1') as f1, open('file2') as f2:
...     matches = set(f1).intersection(f2)
>>> matches
{'marysmith@example.com\n', 'rowdyroddypiper@example.com\n'}

这当然需要将整个file1读入内存。如果那是不可能的，最有效的替代方案可能是对这两个文件进行离线排序，然后对它们进行迭代。

但更简单的解决方案是使用dbm（具有无意义的值）作为磁盘集。 Here's an implementation我打了一巴掌。它需要Python 3.3+，可能在Windows上存在问题，只处理str元素，并且仅支持collections.abc.MutableSet加intersection的最小API;如果您需要旧版本，可移植性，不同的密钥类型，更好的错误处理等，可能需要3分钟以上。无论如何：

>>> import dset
>>> with open('file1') as f1, open('file2') as f2:
...     matches = dset.DiskSet(f1).intersection(f2)

对于中等大小的文件，任何磁盘上的解决方案显然会明显变慢，但是当你遇到不适合内存的巨型文件时，或者更糟糕的是，只有将你的整个计算机扔进交换地狱，它显然赢了。

Answer 2

找到差异：

>>> from difflib import ndiff
>>> diff = ndiff(file('1').readlines(),file('2').readlines())
>>> print ''.join(diff),

要显示反向差异，只需添加if：

$ cat /tmp/1 
hello@world.net
goodbye@cruelworld.com
$ cat /tmp/2
hello@world.net
goodbye@cruelworld.com
hello-once-again@example.com

$ python
>>> diff = ndiff(file('/tmp/1').readlines(),file('/tmp/2').readlines())
>>> print ''.join([x for x in diff if x[0] not in '-+']),
  hello@world.net
  goodbye@cruelworld.com

python - 将文件A与文件B进行比较

2 个答案: