Question

我有一个包含四个字符串的.txt文件，全部由换行符分隔。

当我对文件进行标记时，它会处理每一行数据，这是完美的。

但是，当我尝试从文件中删除停用词时，它只会删除最后一个字符串中的停用词。

我想处理文件中的所有内容，而不仅仅是最后一句。

我的代码：

 with open ('example.txt') as fin:
    for tkn in fin:
        print(word_tokenize(tkn))


#STOP WORDS
stop_words = set(stopwords.words("english"))

words = word_tokenize(tkn)

stpWordsRemoved = []

for stp in words:
    if stp not in stop_words:
        stpWordsRemoved.append(stp)

print("STOP WORDS REMOVED: " , stpWordsRemoved)

输出：

['this', 'is', 'an', 'example', 'of', 'how', 'stop', 'words', 'are', 'utilized', 'in', 'natural', 'language', 'processing', '.']
[]
['drive', 'driver', 'driving', 'driven']
[]
['smile', 'smiling', 'smiled']
[]
['there', 'are', 'multiple', 'words', 'here', 'that', 'you', 'should', 'be', 'able', 'to', 'use', 'for', 'lemmas/synonyms', '.']
STOP WORDS REMOVED:  ['multiple', 'words', 'able', 'use', 'lemmas/synonyms', '.']

如上所示，它只处理最后一行。

编辑：我的txt文件的内容：

this is an example of how stop words are utilized in natural language processing.
A driver goes on a drive while being driven mad. He is sick of driving.  

smile smiling smiled 

there are multiple words here that you should be able to use for lemmas/synonyms.

Answer 1

考虑在readline循环中合并删除停用词功能，如下所示：

import nltk
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
with open("d:/example.txt") as the_file:
    for each_line in the_file:
        print(nltk.word_tokenize(each_line))
        words = nltk.word_tokenize(each_line)
        stp_words_removed = []
        for word in words:
            if word not in stop_words:
                stp_words_removed.append(word)
        print("STOP WORDS REMOVED: ", stp_words_removed)

根据您的描述，您似乎只将最后一行送到了停用词移除器。我不明白的是，如果是这样的话，你不应该得到所有这些空列表。

Answer 2

您需要将word_tokenize的结果附加到列表中，然后处理列表。在您的示例中，您只是在迭代它之后获取文件的最后一行。

<强>尝试：

words = []
with open ('example.txt') as fin:
   for tkn in fin:
       if tkn:
           words.append(word_tokenize(tkn))

#STOP WORDS
stop_words = set(stopwords.words("english"))

stpWordsRemoved = []

for stp in words:
    if stp not in stop_words:
        stpWordsRemoved.append(stp)

print("STOP WORDS REMOVED: " , stpWordsRemoved)

nltk只处理txt文件中的最后一个字符串

2 个答案: