Question

我尝试在线搜索答案，但遗憾的是没有成功。所以我在这里问：

我想弄清楚file1中是否存在file2中的所有行。幸运的是，我可以比较整行而不是单个单词等。不幸的是我正在处理GB文件，因此我尝试过的一些基本解决方案给了我内存错误。

目前我有以下代码无效。我们非常感谢一些指导。

# Checks if all lines in file1 are present in file2
def isFile1SubsetOfFile2(file1 , file2):
    file1 = open(file1, "r")


    for line1 in file1:        
        with open(file2, "r+b") as f:

            mm=mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) 
            my_str_as_bytes = str.encode(line1)
            result = mm.find(line1.strip().encode())
            print(result)
            if result == -1:
                return False
    return True

示例文件2：

This is line1.
This is line2.
This is line3.
This is line4.
This is line5.
This is line6.
This is line7.
This is line8.
This is line9.

如果通过，例如file1是：

This is line4.
This is line5.

例如，如果失败file1是：

This is line4.
This is line10.

编辑：我刚刚为其他人的好处添加了我的代码的工作版本。没有内存错误，但速度很慢。

Answer 1

我不确定为什么它不起作用，但我想我知道如何解决它：

def is_subset_of(file1, file2):
    with open(file1, 'r') as f1, open(file2, 'r') as f2:
        for line in f1:
            line = line.strip()
            f2.seek(0)   # go to the start of f2
            if line not in (line2.strip() for line2 in f2):
                return False
    return True

这样可以避免多次打开第二个文件，总是再次寻找每一行的开头，并且在任何时刻你只能在内存中保留2行。这应该是非常友好的。

另一种方式（可能更快）是对file1和file2进行排序。这样，如果字符串在词法上小于第一个文件中的字符串，则可以逐行比较并移动到另一个文件中的下一行。而不是可以在O(n**2)中执行的O(n*log(n))。然而，这更复杂，我不知道排序GB文件是否有意义（可能会使用太多内存！）。

Answer 2

处理不适合内存的文件总是很难。

如果file1适合内存但file2太大，则可以找到解决方案：

# file1 and file2 are open file-like objects
unseen = set(file1)
for line in file2:
    unseen -= {line} # avoid exception from set.remove
#if unseen is empty, all lines were found in file2

否则，您应该对至少一个文件进行排序（或者CFBS排序）。

测试file1中的行是否是file2

2 个答案: