我有以下代码:
import nltk
sent='El gato está bajo la mesa de cristal.'
nltk.pos_tag(word_tokenize(sent), lang='spa')
但是输出根本不准确:
[('El', 'NNP'),
('gato', 'NN'),
('está', 'NN'),
('bajo', 'NN'),
('la', 'FW'),
('mesa', 'FW'),
('de', 'FW'),
('cristal', 'NN'),
('.', '.')]
例如,es
应该被分类为动词。
如果我使用英语短语尝试相同的操作:
import nltk
sent='The cat is under the cristal table.'
nltk.pos_tag(word_tokenize(sent), lang='spa')
它可以正常工作:
[('The', 'DT'),
('cat', 'NN'),
('is', 'VBZ'),
('under', 'IN'),
('the', 'DT'),
('cristal', 'NN'),
('table', 'NN'),
('.', '.')]
请注意,我已经下载了所有nltk资源。您能告诉我我在这里缺少什么,所以单词标记在西班牙语中不起作用?
答案 0 :(得分:1)
from nltk.tag import StanfordPOSTagger
jar = 'D:/Downloads/stanford-postagger-full-2018-10-16/stanford-postagger-3.9.2.jar'
model = 'D:/Downloads/stanford-postagger-full-2018-10-16/models/spanish.tagger'
import os
java_path = "C:/Program Files/Java/jre1.8.0_191/bin/java.exe"
os.environ['JAVAHOME'] = java_path
pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8' )
pos_tagger.tag('El gato está bajo la mesa de cristal'.split())
结果:
[('El', 'da0000'),
('gato', 'nc0s000'),
('está', 'vmip000'),
('bajo', 'sp000'),
('la', 'da0000'),
('mesa', 'nc0s000'),
('de', 'sp000'),
('cristal', 'nc0s000')]
答案 1 :(得分:0)
尝试一下:
import stanfordnlp
MODELS_DIR = '.'
stanfordnlp.download('es', MODELS_DIR) # Download the Spanish models
nlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=MODELS_DIR, treebank='es_ancora', use_gpu=True, pos_batch_size=3000) # Build the pipeline, specify part-of-speech processor's batch size
doc = nlp("Tu frse en español.") # Run the pipeline on input text
doc.sentences[0].print_tokens() # Look at the result