nltk错误地解析括号

时间:2013-09-13 21:09:39

标签: tags nltk parentheses

我正在标记文字以搜索名词和形容词:

text = u"""Developed at the Vaccine and Gene Therapy Institute at the Oregon Health and Science University (OHSU), the vaccine proved successful in about fifty percent of the subjects tested and could lead to a human vaccine preventing the onset of HIV/AIDS and even cure patients currently on anti-retroviral drugs."""
nltk.pos_tag(nltk.word_tokenize(text))

这导致:

  

[('开发','NNP'),(''','IN'),(''','DT'),('疫苗',   'NNP'),('和','CC'),('基因','NNP'),('治疗','NNP'),   ('学院','NNP'),(''','IN'),(''','DT'),('Oregon','NNP'),   ('健康','NNP'),('和','CC'),('科学','NNP'),('大学',   'NNP'),('(','NNP'),('OHSU','NNP'),(')','NNP'),( '',   ','),(''','DT'),('疫苗','NN'),('证明','VBD'),   ('成功','JJ'),('in','IN'),('about','IN'),('five','JJ'),   ('%','NN'),(''','IN'),(''','DT'),('subject','NNS'),   ('测试','VBD'),('和','CC'),('可','MD'),('领先','VB'),   ('to','TO'),('a','DT'),('human','NN'),('疫苗','NN'),   ('预防','VBG'),(''','DT'),('起始','NN'),(''','IN'),   ('HIV / AIDS','NNS'),('和','CC'),('even','RB'),('cure','NN'),   ('患者','NNS'),('当前','RB'),('on','IN'),   ('抗逆转录病毒','JJ'),('药物','NNS'),('。','。')]

在标记句子时是否有正确检测括号的内置方法?

1 个答案:

答案 0 :(得分:2)

如果您知道要返回的内容作为parens的标记值,那么您可以使用RegexpTagger将parens和fallback匹配为首选标记器。

import nltk
from nltk.data import load
_POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'
tagger = load(_POS_TAGGER)  # same tagger as using nltk.pos_tag

regexp_tagger = nltk.tag.RegexpTagger([(r'\(|\)', '--')], backoff = tagger)

regexp_tagger.tag(nltk.word_tokenize(text))

结果:

  

[(u'Developed','NNP'),(u'at','IN'),(u'the','DT'),(u'Vaccine',   'NNP'),(你和','CC'),(u'Gene','NNP'),(u'Therapy','NNP'),   (u'Institute','NNP'),(u'at','IN'),(你','DT'),(u'Oregon',   'NNP'),(你'健康','NNP'),(你和','CC'),(你'科学','NNP'),   (u'University','NNP'),(你'(',' - '),(u'OHSU','NNP'),(你')',' - '),   (你',',','),(你','DT'),(u'vaccine','NN'),(u'proved','VBD'),   (你是成功的','JJ'),(你','IN'),(你'','IN'),(你很','   'JJ'),(u'percent','NN'),(你','IN'),(你','DT'),   (你的主题','NNS'),(u'tested','VBD'),(你和','CC'),(你可以',   'MD'),(u'lead','VB'),(u'to','TO'),(u'a','DT'),(u'human','NN'),   (u'vaccine','NN'),(u'preventing','VBG'),('''','DT'),(u'onset',   'NN'),(你','IN'),(你爱/艾滋病','NNS'),(你和','CC'),(你们都是,   'RB'),(u'ure','NN'),(u'patients','NNS'),(u'currently','RB'),   (u'on','IN'),(u'anti-retroviral','JJ'),(u'drugs','NNS'),(你'。',   '')]