Question

我正在使用最新版本的CoreNLP。

我的任务是解析文本并使用CollapsedCCProcessedDependenciesAnnotation获得conll格式的输出。

我运行以下命令

time java -cp $CoreNLP/javanlp-core.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -props $CoreNLP/config.properties -file 12309959  -outputFormat conll


depparse.model = english_SD.gz

问题是如何获得CollapsedCCProcessedDependenciesAnnotation。

我试过用 config.properties中的depparse.extradependencies

但根据CCProcessedDependenciesAnnotation没有参数 http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/GrammaticalStructure.Extras.html#REF_ONLY_COLLAPSED

您是否可以提出任何解决方案，我可以使用CollapsedCCProcessedDependenciesAnnotation解析conll？

Answer 1

您可以通过编程方式检索CC处理的依赖项。

This question应该是一个很好的示例（请参阅示例中使用CollapsedCCProcessedDependenciesAnnotation的代码）。

Gabor在邮件列表中的回答很好地解释了这种行为（即，为什么你不能直接输出折叠的依赖项）：

请注意，通常，折叠的cc处理的依赖项不会无损地输出到conll，因为格式需要树（每个单词都有唯一的父级），依赖项可以有多个头。

因此，输出格式化程序仅使用基本依赖项：https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/CoNLLOutputter.java#L118。这可以在代码中更改而不会崩溃任何东西，但是序列化树将缺少一些边缘，并且包含边缘的连接将在某种程度上任意地打破。您可能最好编写自己的逻辑进行转储以适应您的特定用例（您可以从上面复制我们的大部分conll输出代码）。

具有CollapsedCCProcessedDependenciesAnnotation的CoreNLP ConLL格式

1 个答案: