NLTK:从字符串到带有“斜杠”字/ POS的树?

时间:2019-08-10 17:38:15

标签: python tree nltk

nltk.Tree类的漂亮树打印以以下格式打印:

print spacy2tree(nlp(u'Williams is a defensive coach') )
(S
  (SUBJ Williams/NNP)
  (PRED is/VBZ test/VBN)
  a/DT
  defensive/JJ
  coach/NN)

作为树:

 spacy2tree(nlp(u'Williams is a defensive coach') )
 Tree('S', [Tree('SUBJ', [(u'Williams', u'NNP')]), 
    Tree('PRED', [(u'is', u'VBZ'), ('test', 'VBN')]), (u'a', u'DT'), (u'defensive', u'JJ'), (u'coach', u'NN')])

但是没有正确摄取它:

tfs =  spacy2tree(nlp(u'Williams is a defensive coach') ).pformat()

Tree.fromstring(tfs)
Tree('S', [Tree('SUBJ', ['Williams/NNP']), 
   Tree('PRED', ['is/VBZ', 'test/VBN']), 'a/DT', 'defensive/JJ', 'coach/NN'])

示例:

      correct                                    incorrect
 ('SUBJ', [(u'Williams', u'NNP')])       =vs=>    ('SUBJ', ['Williams/NNP'])
('PRED', [(u'is', u'VBZ'), ('test', 'VBN')])  =vs=> ('PRED', ['is/VBZ', 'test/VBN'])

是否有实用程序可以从字符串中正确提取树?

1 个答案:

答案 0 :(得分:0)

似乎我知道了:

 : Tree.fromstring(tfs, read_leaf=lambda s : tuple(s.split('/')))
 : Tree('S', [Tree('SUBJ', [(u'Williams', u'NNP')]), 
         Tree('PRED', [(u'is', u'VBZ'), (u'test', u'VBN')]), (u'a', u'DT'), (u'defensive', u'JJ'), (u'coach', u'NN')])

所以现在这也可以正常工作:

: tree2conlltags(Tree.fromstring(tfs, read_leaf=lambda s : tuple(s.split('/'))))
 : 
 [(u'Williams', u'NNP', u'B-SUBJ'),
  (u'is', u'VBZ', u'B-PRED'),
  (u'test', u'VBN', u'I-PRED'),
  (u'a', u'DT', u'O'),
  (u'defensive', u'JJ', u'O'),
  (u'coach', u'NN', u'O')]