将Averageper Perceptron Tagger POS转换为WordNet POS并避免元组错误

时间:2017-06-28 16:08:42

标签: python python-3.x nlp nltk pos-tagger

我有使用NLTK的平均感知器标记器进行POS标记的代码:

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

string = 'dogs runs fast'

tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)

结果:

[('dogs', 'NNS'), ('runs', 'VBZ'), ('fast', 'RB')]

我尝试使用WordNet变形器来循环遍历每个标记的令牌并将其解释:

lemmatizedWords = []
for w in tokensPOS:
       lemmatizedWords.append(WordNetLemmatizer().lemmatize(w))

print(lemmatizedWords)

结果错误:

Traceback (most recent call last):

  File "<ipython-input-30-462d7c3bdbb7>", line 15, in <module>
    lemmatizedWords = WordNetLemmatizer().lemmatize(w)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\stem\wordnet.py", line 40, in lemmatize
    lemmas = wordnet._morphy(word, pos)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1712, in _morphy
    forms = apply_rules([form])

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1692, in apply_rules
    for form in forms

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1694, in <listcomp>
    if form.endswith(old)]

AttributeError: 'tuple' object has no attribute 'endswith'

我想我有两个问题:

  1. POS标签未转换为WordNet可以理解的标签(我尝试实施类似于此答案的wordnet lemmatization and pos tagging in python但没有成功)
  2. 数据结构未正确形成以便能够循环遍历每个元组(我在os相关代码之外找不到这个错误)
  3. 如何使用词形还原来跟踪POS标记以避免这些错误?

1 个答案:

答案 0 :(得分:2)

Python解释器清楚告诉你:

AttributeError: 'tuple' object has no attribute 'endswith'

tokensPOS是一个元组数组,因此您无法将其元素直接传递给lemmatize()方法(查看类WordNetLemmatizer here的代码)。只有字符串类型对象有方法endswith(),所以你需要从tokenPOS传递每个元组的第一个元素,就像那样:

lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0]))   

方法lemmatize()使用wordnet.NOUN作为默认POS。不幸的是,Wordnet使用与其他nltk语料库不同的标签,因此您必须手动翻译它们(如您提供的链接)并使用适当的标签作为lemmatize()的第二个参数。完整脚本,使用this answer中的方法get_wordnet_pos()

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

string = 'dogs runs fast'

tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)

lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0],get_wordnet_pos(w[1])))

print(lemmatizedWords)