我有使用NLTK的平均感知器标记器进行POS标记的代码:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize
string = 'dogs runs fast'
tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)
结果:
[('dogs', 'NNS'), ('runs', 'VBZ'), ('fast', 'RB')]
我尝试使用WordNet变形器来循环遍历每个标记的令牌并将其解释:
lemmatizedWords = []
for w in tokensPOS:
lemmatizedWords.append(WordNetLemmatizer().lemmatize(w))
print(lemmatizedWords)
结果错误:
Traceback (most recent call last):
File "<ipython-input-30-462d7c3bdbb7>", line 15, in <module>
lemmatizedWords = WordNetLemmatizer().lemmatize(w)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\stem\wordnet.py", line 40, in lemmatize
lemmas = wordnet._morphy(word, pos)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1712, in _morphy
forms = apply_rules([form])
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1692, in apply_rules
for form in forms
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1694, in <listcomp>
if form.endswith(old)]
AttributeError: 'tuple' object has no attribute 'endswith'
我想我有两个问题:
os
相关代码之外找不到这个错误)如何使用词形还原来跟踪POS标记以避免这些错误?
答案 0 :(得分:2)
Python解释器清楚告诉你:
AttributeError: 'tuple' object has no attribute 'endswith'
tokensPOS
是一个元组数组,因此您无法将其元素直接传递给lemmatize()
方法(查看类WordNetLemmatizer
here的代码)。只有字符串类型对象有方法endswith()
,所以你需要从tokenPOS
传递每个元组的第一个元素,就像那样:
lemmatizedWords = []
for w in tokensPOS:
lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0]))
方法lemmatize()
使用wordnet.NOUN
作为默认POS。不幸的是,Wordnet使用与其他nltk语料库不同的标签,因此您必须手动翻译它们(如您提供的链接)并使用适当的标签作为lemmatize()
的第二个参数。完整脚本,使用this answer中的方法get_wordnet_pos()
:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return ''
string = 'dogs runs fast'
tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)
lemmatizedWords = []
for w in tokensPOS:
lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0],get_wordnet_pos(w[1])))
print(lemmatizedWords)