我从NLTK开始,想要标记一个荷兰语句子,但我在指定语料库方面遇到了麻烦。
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import alpino
pos_tag(word_tokenize("Python is een goede data science taal."), tagset = 'alpino')
给出,
[('Python', 'UNK'),
('is', 'UNK'),
('een', 'UNK'),
('goede', 'UNK'),
('data', 'UNK'),
('science', 'UNK'),
('taal', 'UNK'),
('.', 'UNK')]
很明显,我没有正确指定语料库。我下载了alpino语料库。任何人都可以帮我弄清楚如何正确指定语料库吗?
答案 0 :(得分:5)
默认nltk.pos_tag
已针对英文文本进行了培训,您必须在alpino
语料库上训练一个新的标记器以滚动您自己的荷兰语标记。
但请注意,该模型将如下:
来自UnigramTagger
和BigramTagger
示例:
>>> from nltk.corpus import alpino as alp
>>> from nltk.tag import UnigramTagger, BigramTagger
>>> training_corpus = alp.tagged_sents()
>>> unitagger = UnigramTagger(training_corpus)
>>> bitagger = BigramTagger(training_corpus, backoff=unitagger)
>>> pos_tag = bitagger.tag
>>> sent = 'NLTK is een goeda taal voor NLP'.split()
>>> pos_tag(sent)
[('NLTK', None), ('is', u'verb'), ('een', u'det'), ('goeda', None), ('taal', u'noun'), ('voor', u'prep'), ('NLP', None)]
使用PerceptronTagger
:
>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import alpino as alp
>>> training_corpus = list(alp.tagged_sents())
>>> tagger = PerceptronTagger(load=True)
>>> tagger.train(training_corpus)
>>> sent = 'NLTK is een goeda taal voor het leren over NLP'.split()
>>> tagger.tag(sent)
[('NLTK', u'noun'), ('is', u'verb'), ('een', u'det'), ('goeda', u'adj'), ('taal', u'noun'), ('voor', u'prep'), ('het', u'det'), ('leren', u'noun'), ('over', u'prep'), ('NLP', u'noun')
正如@WasiAhmed所指出的,这是另一个很好的例子:https://github.com/evanmiltenburg/Dutch-tagger并且正如@evanmiltenburg在github上所述,尝试在生产中使用更快的标签。
要评估标记器,您可以保留test_set
:
>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import alpino as alp
>>> alp_tagged_sents = list(alp.tagged_sents())
>>> len(alp_tagged_sents)
7136
>>> last_train_sent = int(len(alp_tagged_sents) / 10 * 9)
>>> train_set = alp_tagged_sents[:last_train_sent]
>>> test_set = alp_tagged_sents[last_train_sent:]
然后使用tagger.evaluate()
函数来获得准确性,.evaluate()
函数的输入与.train()
函数的输入相同,即句子列表,每个句子是('word', 'tag')
元组的列表:
>>> tagger = PerceptronTagger(load=False)
>>> tagger.train(train_set)
>>> tagger.evaluate(test_set)
0.927672285043738
答案 1 :(得分:2)
您可以使用此标记器(https://github.com/evanmiltenburg/Dutch-tagger)标记荷兰语句子。准确度为97%。
示例(使用PerceptronTagger
)
from nltk.tag.perceptron import PerceptronTagger
# This may take a few minutes. (But once loaded, the tagger is really fast!)
tagger = PerceptronTagger(load=False)
tagger.load('model.perc.dutch_tagger_small.pickle')
# Tag a sentence.
tagger.tag('Alle vogels zijn nesten begonnen , behalve ik en jij .'.split())
输出
[('Alle', 'det__indef'), ('vogels', 'nounpl'), ('zijn', 'verbprespl'), ('nesten', 'nounpl'), ('begonnen', 'verbpapa'), (',', 'punc'), ('behalve', 'conjsubo'), ('ik', 'pronpers'), ('en', 'conjcoord'), ('jij', 'pronpers'), ('.', '$.')]