NLTK使用语料库标记西班牙语单词

时间:2013-02-06 15:19:25

标签: python nltk

我正在尝试学习如何使用NLTK标记西班牙语单词。

nltk book开始,使用他们的示例标记英语单词非常容易。因为我是nltk和所有语言处理的新手,所以我对如何处理感到很困惑。

我已经下载了cess_esp语料库。有没有办法在nltk.pos_tag中指定语料库。我查看了pos_tag文档,但没有看到任何可能的建议。我觉得我错过了一些关键概念。我是否必须在cess_esp语料库中手动标记文本中的单词? (通过手动我的意思是标记我的信号并再次运行它的语料库)或者我完全没有标记。谢谢

4 个答案:

答案 0 :(得分:15)

首先,你需要从语料库中读取标记的句子。 NLTK提供了一个很好的界面,不用来自不同语料库的不同格式;您可以简单地导入语料库,使用语料库对象函数来访问数据。见http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml

然后您必须选择标记器并训练标记器。有更多花哨的选项,但你可以从N-gram标记开始。

然后您可以使用标记器标记您想要的句子。这是一个示例代码:

from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt

# Read the corpus into a list, 
# each entry in the list is one sentence.
cess_sents = cess.tagged_sents()

# Train the unigram tagger
uni_tag = ut(cess_sents)

sentence = "Hola , esta foo bar ."

# Tagger reads a list of tokens.
uni_tag.tag(sentence.split(" "))

# Split corpus into training and testing set.
train = int(len(cess_sents)*90/100) # 90%

# Train a bigram tagger with only training data.
bi_tag = bt(cess_sents[:train])

# Evaluates on testing data remaining 10%
bi_tag.evaluate(cess_sents[train+1:])

# Using the tagger.
bi_tag.tag(sentence.split(" "))

在大型语料库上训练标记器可能需要很长时间。我们不是每次需要时都训练标记器,而是将训练好的标记器保存在文件中以便以后重复使用。

请查看http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html

中的存储标记部分

答案 1 :(得分:7)

鉴于上一个答案中的教程,这里是spaghetti tagger的一个更面向对象的方法:https://github.com/alvations/spaghetti-tagger

#-*- coding: utf8 -*-

from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt
from cPickle import dump,load

def loadtagger(taggerfilename):
    infile = open(taggerfilename,'rb')
    tagger = load(infile); infile.close()
    return tagger

def traintag(corpusname, corpus):
    # Function to save tagger.
    def savetagger(tagfilename,tagger):
        outfile = open(tagfilename, 'wb')
        dump(tagger,outfile,-1); outfile.close()
        return
    # Training UnigramTagger.
    uni_tag = ut(corpus)
    savetagger(corpusname+'_unigram.tagger',uni_tag)
    # Training BigramTagger.
    bi_tag = bt(corpus)
    savetagger(corpusname+'_bigram.tagger',bi_tag)
    print "Tagger trained with",corpusname,"using" +\
                "UnigramTagger and BigramTagger."
    return

# Function to unchunk corpus.
def unchunk(corpus):
    nomwe_corpus = []
    for i in corpus:
        nomwe = " ".join([j[0].replace("_"," ") for j in i])
        nomwe_corpus.append(nomwe.split())
    return nomwe_corpus

class cesstag():
    def __init__(self,mwe=True):
        self.mwe = mwe
        # Train tagger if it's used for the first time.
        try:
            loadtagger('cess_unigram.tagger').tag(['estoy'])
            loadtagger('cess_bigram.tagger').tag(['estoy'])
        except IOError:
            print "*** First-time use of cess tagger ***"
            print "Training tagger ..."
            from nltk.corpus import cess_esp as cess
            cess_sents = cess.tagged_sents()
            traintag('cess',cess_sents)
            # Trains the tagger with no MWE.
            cess_nomwe = unchunk(cess.tagged_sents())
            tagged_cess_nomwe = batch_pos_tag(cess_nomwe)
            traintag('cess_nomwe',tagged_cess_nomwe)
            print
        # Load tagger.
        if self.mwe == True:
            self.uni = loadtagger('cess_unigram.tagger')
            self.bi = loadtagger('cess_bigram.tagger')
        elif self.mwe == False:
            self.uni = loadtagger('cess_nomwe_unigram.tagger')
            self.bi = loadtagger('cess_nomwe_bigram.tagger')

def pos_tag(tokens, mmwe=True):
    tagger = cesstag(mmwe)
    return tagger.uni.tag(tokens)

def batch_pos_tag(sentences, mmwe=True):
    tagger = cesstag(mmwe)
    return tagger.uni.batch_tag(sentences)

tagger = cesstag()
print tagger.uni.tag('Mi colega me ayuda a programar cosas .'.split())

答案 2 :(得分:5)

我最终在这里搜索了除英语之外的其他语言的POS标记器。解决您问题的另一种方法是使用Spacy库。它提供了多种语言的POS标记,例如荷兰语,德语,法语,葡萄牙语,西班牙语,挪威语,意大利语,希腊语和立陶宛语。

来自Spacy文档:

import es_core_news_sm
nlp = es_core_news_sm.load()
doc = nlp("El copal se usa principalmente para sahumar en distintas ocasiones como lo son las fiestas religiosas.")
print([(w.text, w.pos_) for w in doc])

导致:

[('El','DET'),('copal','NOUN'),('se','PRON'),('usa','VERB'), ('principalmente','ADV'),('para','ADP'),('sahumar','VERB'), ('en','ADP'),('distintas','DET'),('ocasiones','NOUN'),('como', 'SCONJ'),('lo','PRON'),('son','AUX'),('las','DET'),('festas', 'NOUN'),('religiosas','ADJ'),('。','PUNCT')]

并在笔记本中可视化

displacy.render(doc, style='dep', jupyter = True, options = {'distance': 120})

enter image description here

答案 3 :(得分:1)

以下脚本为您提供快速获取词语的方法。用西班牙语句子。请注意,如果您想要正确地执行此操作,则必须在标记之前对句子进行标记,以便进行标记。' religiosas。'必须分成两个令牌' religiosas','。'

#-*- coding: utf8 -*-

# about the tagger: http://nlp.stanford.edu/software/tagger.shtml 
# about the tagset: nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html

import nltk

from nltk.tag.stanford import POSTagger

spanish_postagger = POSTagger('models/spanish.tagger', 'stanford-postagger.jar', encoding='utf8')

sentences = ['El copal se usa principalmente para sahumar en distintas ocasiones como lo son las fiestas religiosas.','Las flores, hojas y frutos se usan para aliviar la tos y también se emplea como sedante.']

for sent in sentences:

    words = sent.split()
    tagged_words = spanish_postagger.tag(words)

    nouns = []

    for (word, tag) in tagged_words:

        print(word+' '+tag).encode('utf8')
        if isNoun(tag): nouns.append(word)

    print(nouns)

给出:

El da0000
copal nc0s000
se p0000000
usa vmip000
principalmente rg
para sp000
sahumar vmn0000
en sp000
distintas di0000
ocasiones nc0p000
como cs
lo pp000000
son vsip000
las da0000
fiestas nc0p000
religiosas. np00000
[u'copal', u'ocasiones', u'fiestas', u'religiosas.']
Las da0000
flores, np00000
hojas nc0p000
y cc
frutos nc0p000
se p0000000
usan vmip000
para sp000
aliviar vmn0000
la da0000
tos nc0s000
y cc
también rg
se p0000000
emplea vmip000
como cs
sedante. nc0s000
[u'flores,', u'hojas', u'frutos', u'tos', u'sedante.']