Question

我试图在推特上使用NLTK WordNet Lemmatizer。

我想删除WordNet中没有找到的所有单词（twitter句柄等），但没有来自WordNetLemmatizer.lemmatize（）的反馈。如果它无法找到它，它只会返回未更改的单词。

有没有办法检查WordNet中是否有单词？

或者除了＆＃34;正确的英语单词之外，还有更好的方法可以删除任何内容。从字符串？

Answer 1

您可以查看wordnet.synsets(token)。一定要处理标点符号，然后检查它是否在列表中。这是一个例子：

from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import wordnet

my_list_of_strings = []  # populate list before using

wpt = WordPunctTokenizer()
only_recognized_words = []

for s in my_list_of_strings:
    tokens = wpt.tokenize(s)
    if tokens:  # check if empty string
        for t in tokens:
            if wordnet.synsets(t):
                only_recognized_words.append(t)  # only keep recognized words

但你应该真正创建一些用于处理Twitter数据的自定义逻辑，特别是处理哈希标签，@ replies，用户名，链接，转推等。有很多论文都有收集的策略。

NLTK WordNet Lemmatizer - 如何删除未知单词？

1 个答案: