如何从文本文件中删除单词集及其变体(或变形)?

时间:2017-04-04 16:04:49

标签: python python-3.x

我正在尝试使用python从包含某些单词及其变体的文本文件中删除行(我担心这是正确的单词)。

我的意思是变体:

"Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"

所以,我尝试使用以下代码手动完成:

infile1 = open("file1.txt",'r')
outfile1 = open("file2.txt",'w')

word_list = ["Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"]

for line in infile1:
    tempList = line.split()
    if any((el in tempList for el in word_list)):
        continue
    else:
        outfile1.write(line)

效果不好,word_list中提到的一些单词仍然存在于输出文件中。还有很多单词变体需要考虑(比如上帝,上帝!,书,书,书,书等等)。

我想知道是否有办法更有效地做到这一点(RE可能是!)。

编辑1:

输入:Sample.txt:

I want my book.

I need my books.

Why you need a book?

Let's go read.

Coming to library

我需要从sample.txt文件中删除包含"book.","books.", "book?"的所有行。

输出:Fixed.txt:

Let's go read

Coming to library

注意:原始语料库有大约60,000行

1 个答案:

答案 0 :(得分:2)

您可以为每一行设置flag并根据flag值发出,如下所示:

input_sample = [
    "I want my book.",
    "I need my books.",
    "Why you need a book?",
    "Let's go read.",
    "Coming to library"
]
words = ['book']
result = []
for line in input_sample : 
    flag = 0    # will be used to check if match is found or not
    for word in words : 
        if word.lower() in line.lower() :    # converting both words and lines to lowercase so case is not a factor in matching
            flag = 1    # flag values set to 1 on the first match
            break    # exits the inner for-loop for no more words need to be checked and so next line can be checked
    if flag == 0 :                      
        result.append(line)    # using lines when there is no match as if-matched, the value of flag would have been 1

print(result)

这导致:

["Let's go read.", 'Coming to library']