我正在尝试使用python从包含某些单词及其变体的文本文件中删除行(我担心这是正确的单词)。
我的意思是变体:
"Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"
所以,我尝试使用以下代码手动完成:
infile1 = open("file1.txt",'r')
outfile1 = open("file2.txt",'w')
word_list = ["Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"]
for line in infile1:
tempList = line.split()
if any((el in tempList for el in word_list)):
continue
else:
outfile1.write(line)
效果不好,word_list
中提到的一些单词仍然存在于输出文件中。还有很多单词变体需要考虑(比如上帝,上帝!,书,书,书,书等等)。
我想知道是否有办法更有效地做到这一点(RE可能是!)。
编辑1:
输入:Sample.txt:
I want my book.
I need my books.
Why you need a book?
Let's go read.
Coming to library
我需要从sample.txt文件中删除包含"book.","books.", "book?"
的所有行。
输出:Fixed.txt:
Let's go read
Coming to library
注意:原始语料库有大约60,000行
答案 0 :(得分:2)
您可以为每一行设置flag
并根据flag
值发出,如下所示:
input_sample = [
"I want my book.",
"I need my books.",
"Why you need a book?",
"Let's go read.",
"Coming to library"
]
words = ['book']
result = []
for line in input_sample :
flag = 0 # will be used to check if match is found or not
for word in words :
if word.lower() in line.lower() : # converting both words and lines to lowercase so case is not a factor in matching
flag = 1 # flag values set to 1 on the first match
break # exits the inner for-loop for no more words need to be checked and so next line can be checked
if flag == 0 :
result.append(line) # using lines when there is no match as if-matched, the value of flag would have been 1
print(result)
这导致:
["Let's go read.", 'Coming to library']