Question

假设我有一个字符串text = "A compiler translates code from a source language"。我想做两件事：

我需要使用NLTK库迭代每个单词和词干。词干的功能是PorterStemmer().stem_word(word)。我们必须传递“单词”这个论点。如何阻止每个单词并取回词干？
我需要从text字符串中删除某些停用词。包含停用词的列表存储在文本文件（空格分隔）
中
```
stopwordsfile = open('c:/stopwordlist.txt','r+')
stopwordslist=stopwordsfile.read()
```
如何从text中删除这些停用词并获取已清理的新字符串？

Answer 1

我发布此评论作为评论，但我想我可以将其充实并完整答案并作出一些解释：

您希望使用str.split()将字符串拆分为单词，然后将每个单词括起来：

for word in text.split(" "):
    PorterStemmer().stem_word(word)

当你想要将所有词干的字符串组合在一起时，将这些词汇重新组合在一起是微不足道的。为方便有效地执行此操作，我们使用str.join()和generator expression：

" ".join(PorterStemmer().stem_word(word) for word in text.split(" "))

编辑：

对于你的其他问题：

with open("/path/to/file.txt") as f:
    words = set(f)

这里我们使用the with statement打开文件（这是打开文件的最佳方式，因为它处理正确关闭它们，即使是异常，并且更具可读性）并将内容读入集合中。我们使用一个集合，因为我们不关心单词的顺序或重复，后来它会更有效。我假设每行有一个单词 - 如果不是这样，并且它们以逗号分隔，或者空格分隔，那么使用str.split()就像我们之前做的那样（使用适当的参数）可能是一个很好的计划。

stems = (PorterStemmer().stem_word(word) for word in text.split(" "))
" ".join(stem for stem in stems if stem not in words)

这里我们使用生成器表达式的if子句来忽略我们从文件加载的单词集中的单词。对集合的成员资格检查是O（1），因此这应该是相对有效的。

编辑2：

要删除它们之前的单词，它甚至更简单：

" ".join(PorterStemmer().stem_word(word) for word in text.split(" ") if word not in words)

删除给定的单词只是：

filtered_words = [word for word in unfiltered_words if not in set_of_words_to_filter]

Answer 2

仔细阅读字符串中的每个单词：

for word in text.split():
    PorterStemmer().stem_word(word)

使用字符串的连接方法（由Lattyware推荐）将片段连接到一个大字符串。

" ".join(PorterStemmer().stem_word(word) for word in text.split(" "))

如何在Python中迭代字符串的句子？

2 个答案: