我正在尝试将Tree对象中的叶值作为字符串。这里的树对象是斯坦福分析器的输出。
这是我的代码:
from nltk.parse import stanford
Parser = stanford.StanfordParser("path")
example = "Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back"
sentences = Parser.raw_parse(example)
for line in sentences:
for sentence in line:
tree = sentence
这就是我提取VP(动词短语)叶子的方法。
VP=[]
VP_tree = list(tree.subtrees(filter=lambda x: x.label()=='VP'))
for i in VP_tree:
VP.append(' '.join(i.flatten()))
这是i.flatten()的样子:(它返回解析的单词列表)
(VP
constructed
logistic
regression
,
calibrated
the
low
defaults
portfolio
to
benchmark
ratings)
因为我只能将它们作为已解析单词的列表,我加入了''。因此,'regression'和','之间有一个空格。
In [33]: VP
Out [33]: [u'constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings']
我想把动词短语作为一个字符串(而不是解析单词列表),而不必通过''加入它们。
我已经查看了Tree类(http://www.nltk.org/_modules/nltk/tree.html)下的方法,但到目前为止没有运气。
答案 0 :(得分:2)
简而言之:
使用Tree.leaves()
函数访问解析句子中子树的字符串,即:
VPs_str = [" ".join(vp.leaves()) for vp in list(parsed_sent.subtrees(filter=lambda x: x.label()=='VP'))]
没有正确的方法来访问真正的VP字符串,因为它们在输入中是因为Stanford解析器在解析过程之前对文本进行了标记化,并且NLTK API没有保留字符串的偏移量=(
长期:
这个长的答案是这样的,其他NLTK用户可以使用NLTK API访问斯坦福分析器来访问Tree
对象,它可能不像问题中所示那样微不足道=)
首先设置NLTK的环境变量以访问Stanford工具,参见:
<强> TL; DR 强>:
$ cd
$ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-12-09.zip
$ unzip stanford-parser-full-2015-12-09.zip
$ export STANFORDTOOLSDIR=$HOME
$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar
应用于2015-12-09编译的Stanford Parser的hack(这个hack将在https://github.com/nltk/nltk/pull/1280/files的最新版本中过时):
>>> from nltk.internals import find_jars_within_path
>>> from nltk.parse.stanford import StanfordParser
>>> parser=StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
>>> stanford_dir = parser._classpath[0].rpartition('/')[0]
>>> parser._classpath = tuple(find_jars_within_path(stanford_dir))
现在进行短语提取。
首先,我们解析句子:
>>> sent = "Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back"
>>> parsed_sent = list(parser.raw_parse(sent))[0]
>>> parsed_sent
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('JJ', ['Selected']), Tree('NNS', ['variables'])]), Tree('PP', [Tree('IN', ['by']), Tree('NP', [Tree('JJ', ['univariate/multivariate']), Tree('NN', ['analysis'])])]), Tree(',', [',']), Tree('VP', [Tree('VBN', ['constructed']), Tree('NP', [Tree('NP', [Tree('JJ', ['logistic']), Tree('NN', ['regression'])]), Tree(',', [',']), Tree('ADJP', [Tree('VBN', ['calibrated']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['low']), Tree('NNS', ['defaults']), Tree('NN', ['portfolio'])]), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('JJ', ['benchmark']), Tree('NNS', ['ratings'])])])])])])]), Tree(',', [','])]), Tree('VP', [Tree('VBD', ['performed']), Tree('ADVP', [Tree('RB', ['back'])])])])])
然后我们遍历树并检查VP,就像你完成的那样:
>>> VP_tree = list(tree.subtrees(filter=lambda x: x.label()=='VP'))
然后,我们只使用子树叶来获取VP
>>> for vp in VPs:
... print " ".join(vp.leaves())
...
constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings
performed back
所以要获得VP字符串:
>>> VPs_str = [" ".join(vp.leaves()) for vp in list(parsed_sent.subtrees(filter=lambda x: x.label()=='VP'))]
>>> VPs_str
[u'constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings', u'performed back']
或者,我个人喜欢使用chunker而不是完整的解析器来提取短语。
使用nltk_cli
工具(https://github.com/alvations/nltk_cli):
alvas@ubi:~/git/nltk_cli$ echo "Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back" > input-doneyo.txt
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk VP input-doneyo.txt calibrated|to benchmark|performed
alvas@ubi:~/git/nltk_cli$ python senna.py --vp input-doneyo.txt
calibrated|to benchmark|performed
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+NP input-doneyo.txt
calibrated the low defaults portfolio|to benchmark ratings
VP标签的输出由|
分隔,即
输出:
calibrated|to benchmark|performed
表示:
VP + NP块输出也由|
分隔,VP和NP由\t
分隔,即
输出:
calibrated the low defaults portfolio|to benchmark ratings
代表(VP + NP):
答案 1 :(得分:2)
要根据输入位置检索字符串,您应该考虑使用https://github.com/smilli/py-corenlp而不是NLTK API来使用Stanford工具。
首先,您必须下载,安装和设置Stanford CoreNLP,请参阅http://stanfordnlp.github.io/CoreNLP/corenlp-server.html#getting-started
然后将python包装器安装到CoreNLP,https://github.com/smilli/py-corenlp
然后,启动服务器后(很多人错过了这一步!),在python中,你可以这样做:
>>> from pycorenlp import StanfordCoreNLP
>>> stanford = StanfordCoreNLP('http://localhost:9000')
>>> text = ("Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back")
>>> output = stanford.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,depparse,parse', 'outputFormat': 'json'})
>>> print(output['sentences'][0]['parse'])
(ROOT
(SINV
(VP (VBN Selected)
(NP (NNS variables))
(PP (IN by)
(NP
(NP (JJ univariate/multivariate) (NN analysis))
(, ,)
(VP (VBN constructed)
(NP (JJ logistic) (NN regression)))
(, ,))))
(VP (VBD calibrated))
(NP
(NP
(NP (DT the) (JJ low) (NNS defaults) (NN portfolio))
(PP (TO to)
(NP (JJ benchmark) (NNS ratings))))
(, ,)
(VP (VBN performed)
(ADVP (RB back))))))
要根据输入字符串检索VP字符串,您必须使用characterOffsetBegin
和characterOffsetEnd
遍历JSON输出:
>>> output['sentences'][0]
{u'tokens': [{u'index': 1, u'word': u'Selected', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 8, u'characterOffsetBegin': 0, u'originalText': u'Selected', u'before': u''}, {u'index': 2, u'word': u'variables', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 18, u'characterOffsetBegin': 9, u'originalText': u'variables', u'before': u' '}, {u'index': 3, u'word': u'by', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 21, u'characterOffsetBegin': 19, u'originalText': u'by', u'before': u' '}, {u'index': 4, u'word': u'univariate/multivariate', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 45, u'characterOffsetBegin': 22, u'originalText': u'univariate/multivariate', u'before': u' '}, {u'index': 5, u'word': u'analysis', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 54, u'characterOffsetBegin': 46, u'originalText': u'analysis', u'before': u' '}, {u'index': 6, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 55, u'characterOffsetBegin': 54, u'originalText': u',', u'before': u''}, {u'index': 7, u'word': u'constructed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 67, u'characterOffsetBegin': 56, u'originalText': u'constructed', u'before': u' '}, {u'index': 8, u'word': u'logistic', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 76, u'characterOffsetBegin': 68, u'originalText': u'logistic', u'before': u' '}, {u'index': 9, u'word': u'regression', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 87, u'characterOffsetBegin': 77, u'originalText': u'regression', u'before': u' '}, {u'index': 10, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 88, u'characterOffsetBegin': 87, u'originalText': u',', u'before': u''}, {u'index': 11, u'word': u'calibrated', u'after': u' ', u'pos': u'VBD', u'characterOffsetEnd': 99, u'characterOffsetBegin': 89, u'originalText': u'calibrated', u'before': u' '}, {u'index': 12, u'word': u'the', u'after': u' ', u'pos': u'DT', u'characterOffsetEnd': 103, u'characterOffsetBegin': 100, u'originalText': u'the', u'before': u' '}, {u'index': 13, u'word': u'low', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 107, u'characterOffsetBegin': 104, u'originalText': u'low', u'before': u' '}, {u'index': 14, u'word': u'defaults', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 116, u'characterOffsetBegin': 108, u'originalText': u'defaults', u'before': u' '}, {u'index': 15, u'word': u'portfolio', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 126, u'characterOffsetBegin': 117, u'originalText': u'portfolio', u'before': u' '}, {u'index': 16, u'word': u'to', u'after': u' ', u'pos': u'TO', u'characterOffsetEnd': 129, u'characterOffsetBegin': 127, u'originalText': u'to', u'before': u' '}, {u'index': 17, u'word': u'benchmark', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 139, u'characterOffsetBegin': 130, u'originalText': u'benchmark', u'before': u' '}, {u'index': 18, u'word': u'ratings', u'after': u'', u'pos': u'NNS', u'characterOffsetEnd': 147, u'characterOffsetBegin': 140, u'originalText': u'ratings', u'before': u' '}, {u'index': 19, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 148, u'characterOffsetBegin': 147, u'originalText': u',', u'before': u''}, {u'index': 20, u'word': u'performed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 158, u'characterOffsetBegin': 149, u'originalText': u'performed', u'before': u' '}, {u'index': 21, u'word': u'back', u'after': u'', u'pos': u'RB', u'characterOffsetEnd': 163, u'characterOffsetBegin': 159, u'originalText': u'back', u'before': u' '}], u'index': 0, u'basic-dependencies': [{u'dep': u'ROOT', u'dependent': 1, u'governorGloss': u'ROOT', u'governor': 0, u'dependentGloss': u'Selected'}, {u'dep': u'dobj', u'dependent': 2, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'variables'}, {u'dep': u'case', u'dependent': 3, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'by'}, {u'dep': u'amod', u'dependent': 4, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'univariate/multivariate'}, {u'dep': u'nmod', u'dependent': 5, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'analysis'}, {u'dep': u'punct', u'dependent': 6, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 7, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'constructed'}, {u'dep': u'amod', u'dependent': 8, u'governorGloss': u'regression', u'governor': 9, u'dependentGloss': u'logistic'}, {u'dep': u'dobj', u'dependent': 9, u'governorGloss': u'constructed', u'governor': 7, u'dependentGloss': u'regression'}, {u'dep': u'punct', u'dependent': 10, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'dep', u'dependent': 11, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'calibrated'}, {u'dep': u'det', u'dependent': 12, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'the'}, {u'dep': u'amod', u'dependent': 13, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'low'}, {u'dep': u'compound', u'dependent': 14, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'defaults'}, {u'dep': u'nsubj', u'dependent': 15, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'portfolio'}, {u'dep': u'case', u'dependent': 16, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'to'}, {u'dep': u'amod', u'dependent': 17, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'benchmark'}, {u'dep': u'nmod', u'dependent': 18, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'ratings'}, {u'dep': u'punct', u'dependent': 19, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 20, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'performed'}, {u'dep': u'advmod', u'dependent': 21, u'governorGloss': u'performed', u'governor': 20, u'dependentGloss': u'back'}], u'parse': u'(ROOT\n (SINV\n (VP (VBN Selected)\n (NP (NNS variables))\n (PP (IN by)\n (NP\n (NP (JJ univariate/multivariate) (NN analysis))\n (, ,)\n (VP (VBN constructed)\n (NP (JJ logistic) (NN regression)))\n (, ,))))\n (VP (VBD calibrated))\n (NP\n (NP\n (NP (DT the) (JJ low) (NNS defaults) (NN portfolio))\n (PP (TO to)\n (NP (JJ benchmark) (NNS ratings))))\n (, ,)\n (VP (VBN performed)\n (ADVP (RB back))))))', u'collapsed-dependencies': [{u'dep': u'ROOT', u'dependent': 1, u'governorGloss': u'ROOT', u'governor': 0, u'dependentGloss': u'Selected'}, {u'dep': u'dobj', u'dependent': 2, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'variables'}, {u'dep': u'case', u'dependent': 3, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'by'}, {u'dep': u'amod', u'dependent': 4, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'univariate/multivariate'}, {u'dep': u'nmod:by', u'dependent': 5, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'analysis'}, {u'dep': u'punct', u'dependent': 6, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 7, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'constructed'}, {u'dep': u'amod', u'dependent': 8, u'governorGloss': u'regression', u'governor': 9, u'dependentGloss': u'logistic'}, {u'dep': u'dobj', u'dependent': 9, u'governorGloss': u'constructed', u'governor': 7, u'dependentGloss': u'regression'}, {u'dep': u'punct', u'dependent': 10, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'dep', u'dependent': 11, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'calibrated'}, {u'dep': u'det', u'dependent': 12, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'the'}, {u'dep': u'amod', u'dependent': 13, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'low'}, {u'dep': u'compound', u'dependent': 14, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'defaults'}, {u'dep': u'nsubj', u'dependent': 15, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'portfolio'}, {u'dep': u'case', u'dependent': 16, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'to'}, {u'dep': u'amod', u'dependent': 17, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'benchmark'}, {u'dep': u'nmod:to', u'dependent': 18, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'ratings'}, {u'dep': u'punct', u'dependent': 19, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 20, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'performed'}, {u'dep': u'advmod', u'dependent': 21, u'governorGloss': u'performed', u'governor': 20, u'dependentGloss': u'back'}], u'collapsed-ccprocessed-dependencies': [{u'dep': u'ROOT', u'dependent': 1, u'governorGloss': u'ROOT', u'governor': 0, u'dependentGloss': u'Selected'}, {u'dep': u'dobj', u'dependent': 2, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'variables'}, {u'dep': u'case', u'dependent': 3, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'by'}, {u'dep': u'amod', u'dependent': 4, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'univariate/multivariate'}, {u'dep': u'nmod:by', u'dependent': 5, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'analysis'}, {u'dep': u'punct', u'dependent': 6, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 7, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'constructed'}, {u'dep': u'amod', u'dependent': 8, u'governorGloss': u'regression', u'governor': 9, u'dependentGloss': u'logistic'}, {u'dep': u'dobj', u'dependent': 9, u'governorGloss': u'constructed', u'governor': 7, u'dependentGloss': u'regression'}, {u'dep': u'punct', u'dependent': 10, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'dep', u'dependent': 11, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'calibrated'}, {u'dep': u'det', u'dependent': 12, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'the'}, {u'dep': u'amod', u'dependent': 13, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'low'}, {u'dep': u'compound', u'dependent': 14, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'defaults'}, {u'dep': u'nsubj', u'dependent': 15, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'portfolio'}, {u'dep': u'case', u'dependent': 16, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'to'}, {u'dep': u'amod', u'dependent': 17, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'benchmark'}, {u'dep': u'nmod:to', u'dependent': 18, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'ratings'}, {u'dep': u'punct', u'dependent': 19, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 20, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'performed'}, {u'dep': u'advmod', u'dependent': 21, u'governorGloss': u'performed', u'governor': 20, u'dependentGloss': u'back'}]}
但由于解析树没有直接链接到偏移量,所以它似乎不是一个容易解析以获得字符偏移量的输出。只有依赖三元组包含链接到偏移量的单词ID的链接。
要访问'after'
中的令牌和'before'
和output['sentences'][0]['tokens']
键(但遗憾的是没有直接链接到解析树):
>>> tokens = output['sentences'][0]['tokens']
>>> tokens
[{u'index': 1, u'word': u'Selected', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 8, u'characterOffsetBegin': 0, u'originalText': u'Selected', u'before': u''}, {u'index': 2, u'word': u'variables', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 18, u'characterOffsetBegin': 9, u'originalText': u'variables', u'before': u' '}, {u'index': 3, u'word': u'by', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 21, u'characterOffsetBegin': 19, u'originalText': u'by', u'before': u' '}, {u'index': 4, u'word': u'univariate/multivariate', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 45, u'characterOffsetBegin': 22, u'originalText': u'univariate/multivariate', u'before': u' '}, {u'index': 5, u'word': u'analysis', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 54, u'characterOffsetBegin': 46, u'originalText': u'analysis', u'before': u' '}, {u'index': 6, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 55, u'characterOffsetBegin': 54, u'originalText': u',', u'before': u''}, {u'index': 7, u'word': u'constructed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 67, u'characterOffsetBegin': 56, u'originalText': u'constructed', u'before': u' '}, {u'index': 8, u'word': u'logistic', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 76, u'characterOffsetBegin': 68, u'originalText': u'logistic', u'before': u' '}, {u'index': 9, u'word': u'regression', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 87, u'characterOffsetBegin': 77, u'originalText': u'regression', u'before': u' '}, {u'index': 10, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 88, u'characterOffsetBegin': 87, u'originalText': u',', u'before': u''}, {u'index': 11, u'word': u'calibrated', u'after': u' ', u'pos': u'VBD', u'characterOffsetEnd': 99, u'characterOffsetBegin': 89, u'originalText': u'calibrated', u'before': u' '}, {u'index': 12, u'word': u'the', u'after': u' ', u'pos': u'DT', u'characterOffsetEnd': 103, u'characterOffsetBegin': 100, u'originalText': u'the', u'before': u' '}, {u'index': 13, u'word': u'low', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 107, u'characterOffsetBegin': 104, u'originalText': u'low', u'before': u' '}, {u'index': 14, u'word': u'defaults', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 116, u'characterOffsetBegin': 108, u'originalText': u'defaults', u'before': u' '}, {u'index': 15, u'word': u'portfolio', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 126, u'characterOffsetBegin': 117, u'originalText': u'portfolio', u'before': u' '}, {u'index': 16, u'word': u'to', u'after': u' ', u'pos': u'TO', u'characterOffsetEnd': 129, u'characterOffsetBegin': 127, u'originalText': u'to', u'before': u' '}, {u'index': 17, u'word': u'benchmark', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 139, u'characterOffsetBegin': 130, u'originalText': u'benchmark', u'before': u' '}, {u'index': 18, u'word': u'ratings', u'after': u'', u'pos': u'NNS', u'characterOffsetEnd': 147, u'characterOffsetBegin': 140, u'originalText': u'ratings', u'before': u' '}, {u'index': 19, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 148, u'characterOffsetBegin': 147, u'originalText': u',', u'before': u''}, {u'index': 20, u'word': u'performed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 158, u'characterOffsetBegin': 149, u'originalText': u'performed', u'before': u' '}, {u'index': 21, u'word': u'back', u'after': u'', u'pos': u'RB', u'characterOffsetEnd': 163, u'characterOffsetBegin': 159, u'originalText': u'back', u'before': u' '}]
答案 2 :(得分:1)
与NLTK
或StanfordParser
无关,获取正常阅读文字的一种方法是使用Moses SMT中的脚本&#34;使用脚本清除输出 ({ {3}}),例如:
alvas@ubi:~$ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/detokenizer.perl
--2016-02-13 21:27:12-- https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/detokenizer.perl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 23.235.43.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|23.235.43.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12473 (12K) [text/plain]
Saving to: ‘detokenizer.perl’
100%[===============================================================================================================================>] 12,473 --.-K/s in 0s
2016-02-13 21:27:12 (150 MB/s) - ‘detokenizer.perl’ saved [12473/12473]
alvas@ubi:~$ echo "constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings" 2> /tmp/null
constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings
请注意,输出MIGHT与输入不同,但对于大多数时间的英语,它将被转换为我们读/写的普通文本。
我正在准备在NLTK中使用detokenizer
,但我们需要一段时间对其进行编码,测试并将其推送到存储库,我们会请您耐心等待(请参阅https://github.com/moses-smt/mosesdecoder)