我一直在尝试使用HMM实现一个简单的POS标记器,并提出以下代码。
import nltk
from nltk.corpus import treebank
train_data = treebank.tagged_sents()[:3000]
print train_data[0]
# [(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), (u',', u','), (u'61', u'CD'), (u'years', u'NNS'), (u'old', u'JJ'), (u',', u','), ... ]
from nltk.tag import hmm
trainer = hmm.HiddenMarkovModelTrainer()
tagger = trainer.train_supervised(train_data)
print tagger
print tagger.tag("Alex was born in Connecticut .".split())
# [('Alex', u'NNP'), ('was', u'NNP'), ('born', u'NNP'), ('in', u'NNP'), ('Connecticut', u'NNP'), ('.', u'NNP')]
print tagger.tag("Joe met Joanne in Delhi .".split())
# [('Joe', u'NNP'), ('met', u'VBD'), ('Joanne', u'NNP'), ('in', u'IN'), ('Delhi', u'NNP'), ('.', u'NNP')]
print tagger.tag("Chicago is the birthplace of Ginny".split())
# [('Chicago', u'NNP'), ('is', u'VBZ'), ('the', u'DT'), ('birthplace', u'NNP'), ('of', u'NNP'), ('Ginny', u'NNP')]
正如你所看到的(许多)标签差不多。为什么是这样?我认为列车组足够大了:| ...?
此外,当我运行tagger.evaluate(treebank.tagged_sents()[3000:])
同时发布了here: