我目前正在使用Stanford CoreNLP的OpenIE系统,使用其Java命令行界面
java -mx32g -cp stanford-corenlp-3.8.0.jar:stanford-corenlp-3.8.0-models.jar:CoreNLP-to-HTML.xsl:slf4j-api.jar:slf4j-simple.jar edu.stanford.nlp.naturalli.OpenIE test_file.txt -threads 8 -resolve_coref true
我的测试文件包含50,000个句子,每行一个。
OpenIE结果将是所有句子的元组列表。是否有一个标志,我可以设置每个元组和特定句子之间的对应关系? (例如,有些句子可能没有提取,有些可能有不止一个。我怎么知道哪个是哪个?)
我目前的解决方案是拥有50,000个文件,每个文件只有一个句子。但这非常慢,因为模型必须重新加载每个文件。
感谢。
编辑:
我意识到-filelist标志使处理速度更快,这是一件好事。但遗憾的是,输出仍然无法区分不同的文件。
答案 0 :(得分:1)
如果使用混响格式(-format reverb
)输出,您应该能够获得句子信息。另外,我希望你能强制令牌化器在换行符上分割句子(-ssplit.newlineIsSentenceBreak always
)。例如,以下命令应该起作用,改编自您的示例:
java -mx8g -cp stanford-corenlp-3.8.0.jar:stanford-corenlp-3.8.0-models.jar:CoreNLP-to-HTML.xsl:slf4j-api.jar:slf4j-simple.jar \
edu.stanford.nlp.naturalli.OpenIE \
-threads 8 -resolve_coref true \
-ssplit.newlineIsSentenceBreak always \
-format reverb \
input.txt
对于以下输入文件:
George Bush was born in Texas
Obama was born in Hawaii
我在stdout上得到以下输出(你可以将它重定向到带有-output <filename>
标志的文件):
input.txt 0 George Bush was born 0 2 2 3 3 4 1.000 George Bush was born in Texas NNP NNP VBD VBN IN NNP George Bush be bear
input.txt 0 George Bush was born in Texas 0 2 2 5 5 1.000 George Bush was born in Texas NNP NNP VBD VBN IN NNP George Bush be bear in Texas
input.txt 1 Obama was born in Hawaii 0 1 1 4 4 5 1.000 Obama was born in Hawaii NNP VBD VBN IN NNP Obama be bear in Hawaii
input.txt 1 Obama was born 0 1 1 2 2 3 1.000 Obama was born in Hawaii NNP VBD VBN IN NNP Obama be bear
第二行是句子索引;完整的制表符分隔列列表记录为on the ReVerb README: