Question

说我有一个大文件“done.txt”

然后我有另一个大文件“post.txt”

我想摆脱已经位于done.txt

中的post.txt中的所有事件

我不想在内存中加载done.txt的所有内容。我该怎么做？

100％的准确度并不重要。

Answer 1

由于不需要100％的准确度，您可以散列done.txt中的所有行，并在内存中保留这些散列的集合（数组，列表等）。

然后，处理post.txt中的每一行。如果该行的哈希值与您已有的哈希值匹配，则将其丢弃。

会出现误报（尽管他们在done.txt中 但是没有假阴性但是没有假阴性。）

类似的东西：

hash = [] for each line in done.txt: hashVal = makeHash (line) hash[hashVal] = true for each line in post.txt: hashVal = makeHash (line) if not defined hash[hashVal]: print line

或者，如果您希望100％精确度和最小的内存存储，请保留哈希值以及每个哈希值的文件偏移量集合。

如果post.txt中的行与任何哈希都不匹配，那么它就不可能重复，所以你保留它。

如果匹配哈希，那么的可能性就是重复。然后，您可以使用该哈希条目的一个或多个文件偏移量对正在测试的行与done.txt中的行进行二进制比较（通过读取实际行）。如果在那里找到匹配，它就是一个骗局，所以你扔掉线，否则你保留它。

这减少了内存存储（当然，除了来自post.txt的行，但无论如何，它们都需要与线路偏移集合一起使用，最多只需一个）来自done.txt的行，代价是一些额外的I / O.

但是，由于我并不是＆＃34;低于100％的准确度＆＃34;，这就是我可能会去的方式。

这就像是：

hash = [] fileOffset = 0 for each line in done.txt: hashVal = makeHash (line) if not defined hash[hashVal]: hash[hashVal] = new list () hash[hashVal].append (fileOffset) fileOffset = fileOffset + line.length () for each line in post.txt: hashVal = makeHash (line) printIt = true if defined hash[hashVal]: for each offset in hash[hashVal]: read chkLine from done.txt starting at offset if line == chkLine: printIt = false if printIt: print line

如何删除已包含在另一行中的所有行

1 个答案: