Question

我有使用NLTK的平均感知器标记器进行POS标记的代码：

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

string = 'dogs runs fast'

tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)

结果：

[('dogs', 'NNS'), ('runs', 'VBZ'), ('fast', 'RB')]

我尝试使用WordNet变形器来循环遍历每个标记的令牌并将其解释：

lemmatizedWords = []
for w in tokensPOS:
       lemmatizedWords.append(WordNetLemmatizer().lemmatize(w))

print(lemmatizedWords)

结果错误：

Traceback (most recent call last):

  File "<ipython-input-30-462d7c3bdbb7>", line 15, in <module>
    lemmatizedWords = WordNetLemmatizer().lemmatize(w)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\stem\wordnet.py", line 40, in lemmatize
    lemmas = wordnet._morphy(word, pos)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1712, in _morphy
    forms = apply_rules([form])

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1692, in apply_rules
    for form in forms

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1694, in <listcomp>
    if form.endswith(old)]

AttributeError: 'tuple' object has no attribute 'endswith'

我想我有两个问题：

POS标签未转换为WordNet可以理解的标签（我尝试实施类似于此答案的wordnet lemmatization and pos tagging in python但没有成功）
数据结构未正确形成以便能够循环遍历每个元组（我在os相关代码之外找不到这个错误）

如何使用词形还原来跟踪POS标记以避免这些错误？

Answer 1

Python解释器清楚告诉你：

AttributeError: 'tuple' object has no attribute 'endswith'

tokensPOS是一个元组数组，因此您无法将其元素直接传递给lemmatize()方法（查看类WordNetLemmatizer here的代码）。只有字符串类型对象有方法endswith()，所以你需要从tokenPOS传递每个元组的第一个元素，就像那样：

lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0]))

方法lemmatize()使用wordnet.NOUN作为默认POS。不幸的是，Wordnet使用与其他nltk语料库不同的标签，因此您必须手动翻译它们（如您提供的链接）并使用适当的标签作为lemmatize()的第二个参数。完整脚本，使用this answer中的方法get_wordnet_pos()：

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

string = 'dogs runs fast'

tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)

lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0],get_wordnet_pos(w[1])))

print(lemmatizedWords)

将Averageper Perceptron Tagger POS转换为WordNet POS并避免元组错误

1 个答案: