NLTK标签荷兰语句子

时间:2016-10-24 07:31:16

标签: python nltk

我从NLTK开始,想要标记一个荷兰语句子,但我在指定语料库方面遇到了麻烦。

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import alpino

pos_tag(word_tokenize("Python is een goede data science taal."), tagset = 'alpino')

给出,

[('Python', 'UNK'),
 ('is', 'UNK'),
 ('een', 'UNK'),
 ('goede', 'UNK'),
 ('data', 'UNK'),
 ('science', 'UNK'),
 ('taal', 'UNK'),
 ('.', 'UNK')]

很明显,我没有正确指定语料库。我下载了alpino语料库。任何人都可以帮我弄清楚如何正确指定语料库吗?

2 个答案:

答案 0 :(得分:5)

默认nltk.pos_tag已针对英文文本进行了培训,您必须在alpino语料库上训练一个新的标记器以滚动您自己的荷兰语标记。

但请注意,该模型将如下:

  • 正在训练的数据
  • 使用
  • 训练的算法

来自UnigramTaggerBigramTagger示例:

>>> from nltk.corpus import alpino as alp
>>> from nltk.tag import UnigramTagger, BigramTagger
>>> training_corpus = alp.tagged_sents()
>>> unitagger = UnigramTagger(training_corpus)
>>> bitagger = BigramTagger(training_corpus, backoff=unitagger)
>>> pos_tag = bitagger.tag
>>> sent = 'NLTK is een goeda taal voor NLP'.split()
>>> pos_tag(sent)
[('NLTK', None), ('is', u'verb'), ('een', u'det'), ('goeda', None), ('taal', u'noun'), ('voor', u'prep'), ('NLP', None)]

使用PerceptronTagger

>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import alpino as alp
>>> training_corpus = list(alp.tagged_sents()) 
>>> tagger = PerceptronTagger(load=True)
>>> tagger.train(training_corpus)
>>> sent = 'NLTK is een goeda taal voor het leren over NLP'.split()
>>> tagger.tag(sent)
[('NLTK', u'noun'), ('is', u'verb'), ('een', u'det'), ('goeda', u'adj'), ('taal', u'noun'), ('voor', u'prep'), ('het', u'det'), ('leren', u'noun'), ('over', u'prep'), ('NLP', u'noun')

正如@WasiAhmed所指出的,这是另一个很好的例子:https://github.com/evanmiltenburg/Dutch-tagger并且正如@evanmiltenburg在github上所述,尝试在生产中使用更快的标签。

EDITED

要评估标记器,您可以保留test_set

>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import alpino as alp
>>> alp_tagged_sents = list(alp.tagged_sents())
>>> len(alp_tagged_sents)
7136
>>> last_train_sent = int(len(alp_tagged_sents) / 10 * 9)
>>> train_set = alp_tagged_sents[:last_train_sent]
>>> test_set = alp_tagged_sents[last_train_sent:]

然后使用tagger.evaluate()函数来获得准确性,.evaluate()函数的输入与.train()函数的输入相同,即句子列表,每个句子是('word', 'tag')元组的列表:

>>> tagger = PerceptronTagger(load=False)
>>> tagger.train(train_set)
>>> tagger.evaluate(test_set)
0.927672285043738

答案 1 :(得分:2)

您可以使用此标记器(https://github.com/evanmiltenburg/Dutch-tagger)标记荷兰语句子。准确度为97%。

示例(使用PerceptronTagger

from nltk.tag.perceptron import PerceptronTagger

# This may take a few minutes. (But once loaded, the tagger is really fast!)
tagger = PerceptronTagger(load=False)
tagger.load('model.perc.dutch_tagger_small.pickle')

# Tag a sentence.
tagger.tag('Alle vogels zijn nesten begonnen , behalve ik en jij .'.split())

输出

[('Alle', 'det__indef'), ('vogels', 'nounpl'), ('zijn', 'verbprespl'), ('nesten', 'nounpl'), ('begonnen', 'verbpapa'), (',', 'punc'), ('behalve', 'conjsubo'), ('ik', 'pronpers'), ('en', 'conjcoord'), ('jij', 'pronpers'), ('.', '$.')]