包含空格的选项字符串会导致nltk.parse.stanford解析器中出现java错误

时间:2016-11-23 21:20:22

标签: python-2.7 nltk stanford-nlp

以下内容适用于在Ubuntu 14.04LTS上运行Python 2.7.6(和JDK 8)的stanford-parser-full-2015-12-09的NLTK 3.2。首先,一点背景......

我想在StanfordDependencyParser的输出中保留标点符号,所以我尝试corenlp_options='-keepPunct',这不起作用。所以我发现,如果在命令行中使用java,那么执行此操作的方法将是-outputFormatOptions "includePunctuationDependencies"

from nltk.parse.stanford import StanfordDependencyParser as SDP
dp = SDP(corenlp_options='-outputFormatOptions includePunctuationDependencies')

但是当我尝试将它传递给corenlp_options时,看起来好像我实际上试图解析一些东西,然后我得到一个OSError:

print [parse.tree() for parse in dp.raw_parse('The quick brown fox jumps over the lazy dog.')]

WARNING! lexparser.Options: Unknown option ignored: -outputFormatOptions includePunctuationDependencies
[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ...
 done [0.4 sec].
Error loading parser, exiting...
Exception in thread "main" java.lang.IllegalArgumentException: Unknown option: -outputFormatOptions includePunctuationDependencies
        at edu.stanford.nlp.parser.lexparser.Options.setOption(Options.java:175)
        at edu.stanford.nlp.parser.lexparser.Options.setOptions(Options.java:68)
        at edu.stanford.nlp.parser.lexparser.Options.setOptions(Options.java:49)
        at edu.stanford.nlp.parser.lexparser.LexicalizedParser.setOptionFlags(LexicalizedParser.java:1007)
        at edu.stanford.nlp.parser.lexparser.LexicalizedParser.loadModel(LexicalizedParser.java:188)
        at edu.stanford.nlp.parser.lexparser.LexicalizedParser.main(LexicalizedParser.java:1412)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/parse/stanford.py", line 132, in raw_parse
    return next(self.raw_parse_sents([sentence], verbose))
  File "/usr/local/lib/python2.7/dist-packages/nltk/parse/stanford.py", line 150, in raw_parse_sents
    return self._parse_trees_output(self._execute(cmd, '\n'.join(sentences), verbose))
  File "/usr/local/lib/python2.7/dist-packages/nltk/parse/stanford.py", line 216, in _execute
    stdout=PIPE, stderr=PIPE)
  File "/usr/local/lib/python2.7/dist-packages/nltk/internals.py", line 134, in java
    raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed : ['/usr/lib/jvm/java-8-oracle/bin/java', u'-mx1000m', '-cp', '/home/dbl/stanford/stanford-english-corenlp-2016-10-31-models.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-sources.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/slf4j-api.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/stanford-parser.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/ejml-0.23.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/slf4j-simple.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-javadoc.jar', u'edu.stanford.nlp.parser.lexparser.LexicalizedParser', u'-model', u'edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz', u'-sentences', u'newline', u'-outputFormat', u'conll2007', u'-encoding', u'utf8', '-outputFormatOptions includePunctuationDependencies', '/tmp/tmpbJ349q']

当然,如果我用空格加入该列表并将其粘贴到shell提示符,它运行正常。问题是NLTK的java使用Popen,并且它对corenlp_options中的空间不满意。除了重写corenlp_options以使用字符串的分割来扩展cmd列表(因为用空格附加字符串是打破Popen的原因),我有什么好的选择吗?

这是来自nltk.parse.stanford.GenericStanfordParser的相关片段(依赖解析器继承):

def _execute(self, cmd, input_, verbose=False):
    encoding = self._encoding
    cmd.extend(['-encoding', encoding])
    if self.corenlp_options:
        cmd.append(self.corenlp_options)

...

1 个答案:

答案 0 :(得分:1)

将选项和值作为单独的参数传递给参数列表,就像您找到的所有其他选项一样。例如:..., u'-encoding', u'utf8', ...。所以只需写下'-outputFormatOptions', 'includePunctuationDependencies'就可以了。