Question

我正在尝试使用python从包含某些单词及其变体的文本文件中删除行（我担心这是正确的单词）。

我的意思是变体：

"Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"

所以，我尝试使用以下代码手动完成：

infile1 = open("file1.txt",'r')
outfile1 = open("file2.txt",'w')

word_list = ["Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"]

for line in infile1:
    tempList = line.split()
    if any((el in tempList for el in word_list)):
        continue
    else:
        outfile1.write(line)

效果不好，word_list中提到的一些单词仍然存在于输出文件中。还有很多单词变体需要考虑（比如上帝，上帝！，书，书，书，书等等）。

我想知道是否有办法更有效地做到这一点（RE可能是！）。

编辑1：

输入：Sample.txt：

I want my book.

I need my books.

Why you need a book?

Let's go read.

Coming to library

我需要从sample.txt文件中删除包含"book.","books.", "book?"的所有行。

输出：Fixed.txt：

Let's go read

Coming to library

注意：原始语料库有大约60,000行

Answer 1

您可以为每一行设置flag并根据flag值发出，如下所示：

input_sample = [
    "I want my book.",
    "I need my books.",
    "Why you need a book?",
    "Let's go read.",
    "Coming to library"
]
words = ['book']
result = []
for line in input_sample : 
    flag = 0    # will be used to check if match is found or not
    for word in words : 
        if word.lower() in line.lower() :    # converting both words and lines to lowercase so case is not a factor in matching
            flag = 1    # flag values set to 1 on the first match
            break    # exits the inner for-loop for no more words need to be checked and so next line can be checked
    if flag == 0 :                      
        result.append(line)    # using lines when there is no match as if-matched, the value of flag would have been 1

print(result)

这导致：

["Let's go read.", 'Coming to library']

如何从文本文件中删除单词集及其变体（或变形）？

1 个答案: