Nltk和标记习语

时间:2016-12-23 11:03:33

标签: nltk

我正在学习西班牙语,并开始时有两个corpa:

cess_esp和conll2002 / esp

我注意到如果我对cess_esp语料库执行以下操作:

>>> from nltk.corpus import cess_esp as cess
>>> symbols = list(set(w[0].lower() for s in cess.tagged_sents() for w in s))
>>> list(filter(lambda x: "cabo" in x, symbols))
['llevaron_a_cabo', 'llevaran_a_cabo', 'cabo', 'llevó_a_cabo', 'llevado_a_cabo', 'llevarán_a_cabo', 'lleva_a_cabo', 'al_cabo_de', 'llevan_a_cabo', 'al_cabo', 'llevadas_a_cabo', 'llevará_a_cabo', 'llevarse_a_cabo', 'llevar_a_cabo', 'al_fin_y_al_cabo']

它似乎代表了单词之间带有下划线的习语。在培训标记器时,nltk是否具有处理这些下划线的功能?

这是我的pos标签。如何更新它以便我可以标记成语?

from nltk.corpus import cess_esp as cess
spanish_sentence_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')

def trigram_tagger(training_set):
   default_tagger = nltk.DefaultTagger('NN')
   unigram_tagger = nltk.UnigramTagger(training_set, backoff=default_tagger)
   bigram_tagger = nltk.BigramTagger(training_set, backoff=unigram_tagger)
   return nltk.TrigramTagger(training_set, backoff=bigram_tagger)

def tag_sentences(tagger, sentence_tokenizer, sentences):
   tokenized_sentences = sentence_tokenizer.tokenize(sentences)
   return [tagger.tag(word_tokenize(s)) for s in tokenized_sentences]

tagger = trigram_tagger(cess.tagged_sents())
print(tag_sentences(tagger, spanish_sentence_tokenizer, sentences))

0 个答案:

没有答案