西班牙语单词tokeniser

时间:2016-12-26 23:46:47

标签: python-3.x nltk

我想将西班牙语句子标记为单词。以下是正确的方法还是更好的方法?

import nltk
from nltk.tokenize import word_tokenize

def spanish_word_tokenize(s):
    for w in word_tokenize(s):    
        if w[0] in ("¿","¡"):   
            yield w[0]
            yield w[1:]
        else:
            yield w          

sentences = "¿Quién eres tú? ¡Hola! ¿Dónde estoy?"

spanish_sentence_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')

sentences = spanish_sentence_tokenizer.tokenize(sentences)
for s in sentences:
    print([s for s in spanish_word_tokenize(s)])

2 个答案:

答案 0 :(得分:2)

C.f。 NLTK github问题#1214,在NLTK中有很多替代的标记符=)

E.g。使用@jonsafari toktok tokenizer的NLTK端口:

>>> import nltk
>>> nltk.download('perluniprops')
[nltk_data] Downloading package perluniprops to
[nltk_data]     /Users/liling.tan/nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!
True
>>> nltk.download('nonbreaking_prefixes')
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /Users/liling.tan/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!
True
>>> from nltk.tokenize.toktok import ToktokTokenizer
>>> toktok = ToktokTokenizer()
>>> sent = u"¿Quién eres tú? ¡Hola! ¿Dónde estoy?"
>>> toktok.tokenize(sent)
[u'\xbf', u'Qui\xe9n', u'eres', u't\xfa', u'?', u'\xa1Hola', u'!', u'\xbf', u'D\xf3nde', u'estoy', u'?']
>>> print " ".join(toktok.tokenize(sent))
¿ Quién eres tú ? ¡Hola ! ¿ Dónde estoy ?

>>> from nltk import sent_tokenize
>>> sentences = u"¿Quién eres tú? ¡Hola! ¿Dónde estoy?"
>>> [toktok.tokenize(sent) for sent in sent_tokenize(sentences, language='spanish')]
[[u'\xbf', u'Qui\xe9n', u'eres', u't\xfa', u'?'], [u'\xa1Hola', u'!'], [u'\xbf', u'D\xf3nde', u'estoy', u'?']]

>>> print '\n'.join([' '.join(toktok.tokenize(sent)) for sent in sent_tokenize(sentences, language='spanish')])
¿ Quién eres tú ?
¡Hola !
¿ Dónde estoy ?

如果您稍微破解代码并在https://github.com/nltk/nltk/blob/develop/nltk/tokenize/toktok.py#L51中添加u'\xa1',您应该能够获得:

[[u'\xbf', u'Qui\xe9n', u'eres', u't\xfa', u'?'], [u'\xa1', u'Hola', u'!'], [u'\xbf', u'D\xf3nde', u'estoy', u'?']]

答案 1 :(得分:2)

使用 spacy 有一个更简单的解决方案。但是,仅适用于之前下载的 spacy 数据:python -m spacy download es

import spacy

nlp = spacy.load('es')
sentences = "¿Quién eres tú? ¡Hola! ¿Dónde estoy?"
doc = nlp(sentences)
tokens = [token for token in doc]
print(tokens)

给出正确答案:

[¿, Quién, eres, tú, ?, ¡, Hola, !, ¿, Dónde, estoy, ?]

我不推荐 nltk ToktokTokenizer 因为根据文档“输入必须是每行一个句子;因此只有最后一个句点被标记。”所以你必须首先担心逐句分段.