我使用nltk来标记一些阿拉伯语文本
然而,我最终得到了一些结果,如
(你是一个阿拉伯语字/单词','``') 要么 (你是一个阿拉伯字符/单词',':')
但是,他们没有在文档中提供“或”。
因此我想知道这是什么
from nltk.toeknize.punkt import PunktWordTokenizer
z = "أنا تسلق شجرة"
tkn = PunkWordTokenizer
sen = tkn.tokenize(z)
tokens = nltk.pos_tag(sent)
print tokens
答案 0 :(得分:3)
默认的NLTK POS标签是针对英文文本进行培训的,据推测用于英文文本处理,请参阅http://www.nltk.org/_modules/nltk/tag.html。文档:
An off-the-shelf tagger is available. It uses the Penn Treebank tagset:
>>> from nltk.tag import pos_tag # doctest: +SKIP
>>> from nltk.tokenize import word_tokenize # doctest: +SKIP
>>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
('.', '.')]
pos_tag
的代码:
from nltk.data import load
# Standard treebank POS tagger
_POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'
def pos_tag(tokens):
"""
Use NLTK's currently recommended part of speech tagger to
tag the given list of tokens.
>>> from nltk.tag import pos_tag # doctest: +SKIP
>>> from nltk.tokenize import word_tokenize # doctest: +SKIP
>>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
('.', '.')]
:param tokens: Sequence of tokens to be tagged
:type tokens: list(str)
:return: The tagged tokens
:rtype: list(tuple(str, str))
"""
tagger = load(_POS_TAGGER)
return tagger.tag(tokens)
这对我来说可以在Ubuntu 14.4.1上使用python工作的Stanford工具:
$ cd ~
$ wget http://nlp.stanford.edu/software/stanford-postagger-full-2015-01-29.zip
$ unzip stanford-postagger-full-2015-01-29.zip
$ wget http://nlp.stanford.edu/software/stanford-segmenter-2015-01-29.zip
$ unzip /stanford-segmenter-2015-01-29.zip
$ python
然后:
from nltk.tag.stanford import POSTagger
path_to_model= '/home/alvas/stanford-postagger-full-2015-01-30/models/arabic.tagger'
path_to_jar = '/home/alvas/stanford-postagger-full-2015-01-30/stanford-postagger-3.5.1.jar'
artagger = POSTagger(path_to_model, path_to_jar, encoding='utf8')
artagger._SEPARATOR = '/'
tagged_sent = artagger.tag(u"أنا تسلق شجرة")
print(tagged_sent)
[OUT]:
$ python3 test.py
[('أ', 'NN'), ('ن', 'NN'), ('ا', 'NN'), ('ت', 'NN'), ('س', 'RP'), ('ل', 'IN'), ('ق', 'NN'), ('ش', 'NN'), ('ج', 'NN'), ('ر', 'NN'), ('ة', 'PRP')]
如果您在使用Stanford POS标记时遇到java问题,请参阅DELPH-IN wiki:http://moin.delph-in.net/ZhongPreprocessing