我有一个包含四个字符串的.txt文件,全部由换行符分隔。
当我对文件进行标记时,它会处理每一行数据,这是完美的。
但是,当我尝试从文件中删除停用词时,它只会删除最后一个字符串中的停用词。
我想处理文件中的所有内容,而不仅仅是最后一句。
我的代码:
with open ('example.txt') as fin:
for tkn in fin:
print(word_tokenize(tkn))
#STOP WORDS
stop_words = set(stopwords.words("english"))
words = word_tokenize(tkn)
stpWordsRemoved = []
for stp in words:
if stp not in stop_words:
stpWordsRemoved.append(stp)
print("STOP WORDS REMOVED: " , stpWordsRemoved)
输出:
['this', 'is', 'an', 'example', 'of', 'how', 'stop', 'words', 'are', 'utilized', 'in', 'natural', 'language', 'processing', '.']
[]
['drive', 'driver', 'driving', 'driven']
[]
['smile', 'smiling', 'smiled']
[]
['there', 'are', 'multiple', 'words', 'here', 'that', 'you', 'should', 'be', 'able', 'to', 'use', 'for', 'lemmas/synonyms', '.']
STOP WORDS REMOVED: ['multiple', 'words', 'able', 'use', 'lemmas/synonyms', '.']
如上所示,它只处理最后一行。
编辑: 我的txt文件的内容:
this is an example of how stop words are utilized in natural language processing.
A driver goes on a drive while being driven mad. He is sick of driving.
smile smiling smiled
there are multiple words here that you should be able to use for lemmas/synonyms.
答案 0 :(得分:0)
考虑在readline循环中合并删除停用词功能,如下所示:
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
with open("d:/example.txt") as the_file:
for each_line in the_file:
print(nltk.word_tokenize(each_line))
words = nltk.word_tokenize(each_line)
stp_words_removed = []
for word in words:
if word not in stop_words:
stp_words_removed.append(word)
print("STOP WORDS REMOVED: ", stp_words_removed)
根据您的描述,您似乎只将最后一行送到了停用词移除器。我不明白的是,如果是这样的话,你不应该得到所有这些空列表。
答案 1 :(得分:0)
您需要将word_tokenize的结果附加到列表中,然后处理列表。在您的示例中,您只是在迭代它之后获取文件的最后一行。
<强>尝试:强>
words = []
with open ('example.txt') as fin:
for tkn in fin:
if tkn:
words.append(word_tokenize(tkn))
#STOP WORDS
stop_words = set(stopwords.words("english"))
stpWordsRemoved = []
for stp in words:
if stp not in stop_words:
stpWordsRemoved.append(stp)
print("STOP WORDS REMOVED: " , stpWordsRemoved)