以下内容适用于在Ubuntu 14.04LTS上运行Python 2.7.6(和JDK 8)的stanford-parser-full-2015-12-09的NLTK 3.2。首先,一点背景......
我想在StanfordDependencyParser的输出中保留标点符号,所以我尝试corenlp_options='-keepPunct'
,这不起作用。所以我发现,如果在命令行中使用java,那么执行此操作的方法将是-outputFormatOptions "includePunctuationDependencies"
。
from nltk.parse.stanford import StanfordDependencyParser as SDP
dp = SDP(corenlp_options='-outputFormatOptions includePunctuationDependencies')
但是当我尝试将它传递给corenlp_options时,看起来好像我实际上试图解析一些东西,然后我得到一个OSError:
print [parse.tree() for parse in dp.raw_parse('The quick brown fox jumps over the lazy dog.')]
WARNING! lexparser.Options: Unknown option ignored: -outputFormatOptions includePunctuationDependencies
[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ...
done [0.4 sec].
Error loading parser, exiting...
Exception in thread "main" java.lang.IllegalArgumentException: Unknown option: -outputFormatOptions includePunctuationDependencies
at edu.stanford.nlp.parser.lexparser.Options.setOption(Options.java:175)
at edu.stanford.nlp.parser.lexparser.Options.setOptions(Options.java:68)
at edu.stanford.nlp.parser.lexparser.Options.setOptions(Options.java:49)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.setOptionFlags(LexicalizedParser.java:1007)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.loadModel(LexicalizedParser.java:188)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.main(LexicalizedParser.java:1412)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/parse/stanford.py", line 132, in raw_parse
return next(self.raw_parse_sents([sentence], verbose))
File "/usr/local/lib/python2.7/dist-packages/nltk/parse/stanford.py", line 150, in raw_parse_sents
return self._parse_trees_output(self._execute(cmd, '\n'.join(sentences), verbose))
File "/usr/local/lib/python2.7/dist-packages/nltk/parse/stanford.py", line 216, in _execute
stdout=PIPE, stderr=PIPE)
File "/usr/local/lib/python2.7/dist-packages/nltk/internals.py", line 134, in java
raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed : ['/usr/lib/jvm/java-8-oracle/bin/java', u'-mx1000m', '-cp', '/home/dbl/stanford/stanford-english-corenlp-2016-10-31-models.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-sources.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/slf4j-api.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/stanford-parser.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/ejml-0.23.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/slf4j-simple.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-javadoc.jar', u'edu.stanford.nlp.parser.lexparser.LexicalizedParser', u'-model', u'edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz', u'-sentences', u'newline', u'-outputFormat', u'conll2007', u'-encoding', u'utf8', '-outputFormatOptions includePunctuationDependencies', '/tmp/tmpbJ349q']
当然,如果我用空格加入该列表并将其粘贴到shell提示符,它运行正常。问题是NLTK的java使用Popen,并且它对corenlp_options中的空间不满意。除了重写corenlp_options以使用字符串的分割来扩展cmd列表(因为用空格附加字符串是打破Popen的原因),我有什么好的选择吗?
这是来自nltk.parse.stanford.GenericStanfordParser的相关片段(依赖解析器继承):
def _execute(self, cmd, input_, verbose=False):
encoding = self._encoding
cmd.extend(['-encoding', encoding])
if self.corenlp_options:
cmd.append(self.corenlp_options)
...
答案 0 :(得分:1)
将选项和值作为单独的参数传递给参数列表,就像您找到的所有其他选项一样。例如:..., u'-encoding', u'utf8', ...
。所以只需写下'-outputFormatOptions', 'includePunctuationDependencies'
就可以了。