Question

我有一个文件列表 - 每行一个字 - filterlist.txt。另一个文件是text- text.txt的巨大字符串。

我想在text.txt中找到filterlist.txt中所有单词的实例并删除它们。

这是我到目前为止所拥有的：

text = open('ttext.txt').read().split()
filter_words = open('filterlist.txt').readline()

for line in text:
    for word in filter_words:
        if word == filter_words:
            text.remove(word)

Answer 1

将过滤词存储在一个集合中，迭代ttext.txt中的行中的单词，并且只保留不在过滤词集合中的单词。

with open('ttext.txt') as text,  open('filterlist.txt') as filter_words:
    st = set(map(str.rstrip,filter_words))
    txt = next(text).split()
    out = [word  for word in txt if word not in st]

如果你想忽略大小写并删除标点符号，你需要在每一行调用lower并删除标点符号：

from string import punctuation
with open('ttext.txt') as text,  open('filterlist.txt') as filter_words:
    st = set(word.lower().rstrip(punctuation+"\n") for word in  filter_words)
    txt = next(text).lower().split()
    out = [word  for word in txt if word not in st]

如果使用ttext在(word for line in text for word in line.split())中有多行，则会提高内存效率。

Answer 2

使用Padraic Cunningham的原理我将其编码为函数

fig, ax = plt.subplots()
ax.plot(x, x)
ticks = ax.get_xticks()
ax.set_xticks(ticks[1:])
plt.show()

使用集合非常重要，而不是第二个参数中的列表。列表中的查找是O（n），字典中的查找是分摊的O（1）。所以对于大文件来说这是最佳的。

Answer 3

假设这是您在text.txt文件中的内容：'hello foo apple car water cat'，这就是filterlist.txt文件中的内容：apple car

text = open('text.txt').read().strip("'").split(' ')
    filter_words = open('filterlist.txt').readline().split()
    for i in filter_words:
        if i in text:
            text.remove(i)
            new_text = ' '.join(text)
    print new_text

输出将是：

hello foo water cat

从另一个文本文件中的一个文本文件中过滤单词？

3 个答案: