阅读并处理＆＃34; woorden.txt＆＃34;逐行

Question

我有一个名为tekst的非常长的字符串（从文件中读取600 MB）和一个名为nlwoorden的11.000个字的列表。我想拥有tekst中的所有内容，但不包含nlwoorden中的内容。

belangrijk=[woord for woord in tekst.split() if woord not in nlwoorden]

会产生我想要的东西。显然，计算需要很长时间。有没有更有效的方法？

谢谢！

Answer 1

使用基于设置的解决方案可为您提供O(len(nlwoorden)) for the whole thing。它应该采用另一个O(len(nlwoorden)) + O(len(tekst)) to make the two sets。

因此，您正在寻找的代码段基本上是评论中列出的代码段：

belangrijk=list(set(tekst.split()) - set(nlwoorden))

（假设你最后想把它作为一个清单）

Answer 2

我认为最简单的方法是使用套装。例如，

s = "This is a test"
s2 = ["This", "is", "another", "test"]
set(s.split()) - set(s2)

# returns {'a'}

但是，考虑到文本的大小，使用生成器来避免一次将所有内容保存在内存中可能是值得的，例如，

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()

[word for word in itersplit(s) if word not in s2]

# returns ['a']

Answer 3

此片段：

woord not in nlwoorden

每次调用N = len(nlwoorden)时都会 O（N）。

所以你的列表理解，

belangrijk=[woord for woord in tekst.split() if woord not in nlwoorden]

M = len(tekst.split())的总时间为 O（N * M）。

这是因为nlwoorden是一个列表，而不是一个集合。要使用简单的方法测试无序列表中的成员资格，您必须在最坏的情况下遍历整个列表。

这就是为什么你的陈述需要很长时间才能输入大量的内容。

如果你有一个哈希集，那么一旦构造了这个集合，就需要花费一些时间来测试成员资格。

因此，在原型代码形式中，类似这样的东西：

import io

def words(fileobj):
    for line in fileobj:             # takes care of buffering large files, chunks at a time
        for word in line.split():
            yield word

# first, build the set of whitelisted words
wpath = 'whitelist.txt'
wset = set()
with io.open(wpath, mode='rb') as w:
    for word in words(w):
        wset.add(word)

def emit(word):
    # output 'word' - to a list, to another file, to a pipe, etc
    print word

fpath = 'input.txt'
with io.open(fpath, mode='rb') as f:
    for word in words(f):               # total run time - O(M) where M = len(words(f))
        if word not in wset:            # testing for membership in a hash set - O(1)
            emit(word)

Answer 4

阅读并处理＆＃34; woorden.txt＆＃34;逐行

将所有nlwoorden加载到集合中（这比列表更有效）
逐个读取大文件，对每个部分进行拆分，只写入lnwoorden中没有的文件。

假设你的大600 MB文件有合理的长行（不是600 MB长），我会这样做

nlwoorden = set()
with open("nlwoorden.txt") as f:
    for line in f:
        nlwoorden.update(line.split())

with open("woorden.txt") as f, with open("out.txt", "w") as fo:
    for line in f:
        newwords = set(line.split())
        newwords.difference_update(nlwoorden)
        fo.write(" ".join(newwords)

结论

此解决方案不会占用太多内存，因为您从未读过＆＃34; woorden.txt＆＃34;中的所有数据。马上。

如果您的文件没有按行分割，则必须更改读取文件部分的方式。但我认为，你的文件会有换行符。

从字符串中获取不在另一个列表中的单词列表

4 个答案:

阅读并处理＆＃34; woorden.txt＆＃34;逐行

结论