from stemming.porter2 import stem
documents = ['got',"get"]
documents = [[stem(word) for word in sentence.split(" ")] for sentence in documents]
print(documents)
结果是:
[['got'], ['get']]
有人可以帮忙解释一下吗? 谢谢!
答案 0 :(得分:2)
您想要的是成词机而不是词干提取器。区别是微妙的。
通常,词干分析器会尽可能地删除后缀,并且在某些情况下,对于无法通过简单地删除后缀来查找归一化形式的单词,可以处理例外单词列表。
lemmatizer试图找到单词的“基本” /词根/不定式形式,通常,它需要针对不同语言的专门规则。
请参见
使用morphlem lemmatizer的NLTK实现进行词法化要求正确的词性(POS)标签相当准确。
避免(或实际上从不)尝试单独隔离词的词性。尝试对完全带有POS标签的句子进行词组化,例如
from nltk import word_tokenize, pos_tag
from nltk import wordnet as wn
def penn2morphy(penntag, returnNone=False, default_to_noun=False):
morphy_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ,
'VB':wn.VERB, 'RB':wn.ADV}
try:
return morphy_tag[penntag[:2]]
except:
if returnNone:
return None
elif default_to_noun:
return 'n'
else:
return ''
使用penn2morphy helper函数,您需要将POS标签从pos_tag()
转换为morphy标签,然后您可以:
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> sent = "He got up in bed at 8am."
>>> [(token, penn2morphy(tag)) for token, tag in pos_tag(word_tokenize(sent))]
[('He', ''), ('got', 'v'), ('up', ''), ('in', ''), ('bed', 'n'), ('at', ''), ('8am', ''), ('.', '')]
>>> [wnl.lemmatize(token, pos=penn2morphy(tag, default_to_noun=True)) for token, tag in pos_tag(word_tokenize(sent))]
['He', 'get', 'up', 'in', 'bed', 'at', '8am', '.']
为方便起见,您也可以尝试使用pywsd
lemmatizer。
>>> from pywsd.utils import lemmatize_sentence
Warming up PyWSD (takes ~10 secs)... took 7.196984529495239 secs.
>>> sent = "He got up in bed at 8am."
>>> lemmatize_sentence(sent)
['he', 'get', 'up', 'in', 'bed', 'at', '8am', '.']