我想删除包含停用词或匹配字符串的特定行:
import nltk
from nltk import *
from nltk.tokenize import word_tokenize
import time
Mywords = 'hello', 'there', 'been'
#stopwords for matching in the sentences.
f = open('hello.txt','rU')
raw = f.read()
sent = word_tokenize(raw)
#tokenize the words.
from nltk.tokenize import wordpunct_tokenize
punct = wordpunct_tokenize(raw)
sent = sent_tokenize(raw)
length = len(sent)
print(length)
i = 0
while(i<length):
i = i + 1
time.sleep(2)
#print(sent[i])
if i <length:
#print(sent[i])
thisWord = (word_tokenize(sent[i]))
for word in thisWord:
if word in Mywords:
#print(thisWord, word)
print("yes: ", sent[i])
else:
print("No:", sent[i])
else:
print("End of Line")
答案 0 :(得分:0)
您无法真正删除文件中的行,但您可以将所有不包含任何停用词的行写入另一个文件。
以下脚本首先获取停用词列表并将其转换为set()
。然后它一次读取您的输入文件。对于要使用nltk.word_tokenize()
创建单词列表的每一行。它将这个单词列表转换为一个集合,并将其与停用单词的交集。如果这不是空的那么肯定存在一些停用词。然后显示它找到的停用词。
如果未找到,则将剩余的行写入output.txt
文件:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# Full list of stop words from nltk
#stop_words = set(stopwords.words('english'))
stop_words = set(['hello', 'there', 'been'])
with open('input.txt','rU') as f_input, open('output.txt', 'w') as f_output:
for line in f_input:
line_words = set(word_tokenize(line))
stop_words_present = line_words & stop_words
if stop_words_present:
print("Yes: '{}' contains {}".format(line.strip(), stop_words_present)) # Contains at least one stop word
else:
print("No:", line.strip()) # Contains non stop stops
f_output.write(line)
注意nltk
包含您可以使用的完整英语停用词列表,只需更改上面的行即可。如果找不到如下所示,您可能需要先安装它。运行以下迷你脚本:
import nltk
nltk.download()
这将显示一个下载实用程序,允许您按如下方式获取stopwords
:
选择Corpora
,向下滚动至stopwords
,然后点击Download
按钮。