Question

如果您将单词"US"（美国）输入到包"us"中的WordNetLemmatizer预处理（变成nltk.stem，即小写）之后，它将被翻译为"u"。例如：

from nltk.stem import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
word = "US".lower()  #  "US" becomes "us"
lemma = lmtzr.lemmatize(word)
print(lemma)  # prints "u"

我什至尝试使用POS标记对单词进行去词缀处理，根据'NNP'函数的pos_tag()功能（来自软件包{{ 1}}。但是nltk是'NNP'，这是词条分解器处理单词时的默认行为。因此，wordnet.NOUN和lmtzr.lemmatize(word)是相同的（其中lmtz.lemmatize(word, wordnet.NOUN)是从包wordnet导入的。）

除了笨拙的使用nltk.stem.wordnet语句从形容词中排除对文本中的"us"单词进行处理的笨拙方式之外，还有其他任何有关如何解决此问题的想法吗？

Answer 1

如果您查看WordNetLemmatizer的源代码

def lemmatize(self, word, pos=NOUN):
    lemmas = wordnet._morphy(word, pos)
    return min(lemmas, key=len) if lemmas else word

wordnet._morphy返回['us', 'u']

min(lemmas, key=len)返回最短单词u

wordnet._morphy对名词使用规则，以"s"代替结尾的""。

以下是替换列表

[('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')]

我看不出很干净的出路。

1）您可以编写特殊规则以排除所有大写字母。

2）或者您可以添加一行us us

到文件nltk_data/corpora/wordnet/noun.exc

3）您可以编写自己的函数来选择最长的单词（其他单词可能有误）

from nltk.corpus.reader.wordnet import NOUN
from nltk.corpus import wordnet
def lemmatize(word, pos=NOUN):
    lemmas = wordnet._morphy(word, pos)
    return max(lemmas, key=len) if lemmas else word

NLTK WordNetLemmatizer将“ US”处理为“ u”

1 个答案: