我正在寻找用 Petrarch 解析一系列新闻故事。根据其official document:
PETRARCH的主要输入格式是每个条目的XML文档 在文档中要解析的句子或故事。输入可以是 个别句子或整个故事。另外,输入 可以包含来自StanfordNLP的预解析信息或仅包含普通信息 斯坦福大学解析的文字留给了TABARI。是否输入 使用PETRARCH中的-P标志指示是否解析 命令行参数。
换句话说,Petrarch使用StanfordNLP作为其解析工具的一部分。
我的新闻文档都在一个没有XML结构的txt文件中(因此,没有句子属性,id,但有日期)。但我想尝试使用示例文本来查看是否有效,在这种情况下,我会将这些文本重新编程为相应的格式。以下是样本:
<document>
<Sentences>
<Sentence sentence = "Boolean" id = "1" date = "20151026">
<Text>China, Japan and South Korea will hold a summit in South Korea when Chinese Premier Li Keqiang visits.</Text>
</Sentence>
<Sentence sentence = "Boolean" id = "2" date = "20151027">
<Text>It is the first China-Japan-South Korea meeting since they were discontinued in 2012 amid tension dating back to World War Two.</Text>
</Sentence>
<Sentence sentence = "Boolean" id = "3" date = "20151027">
<Text>Marry has a happy life.</Text>
</Sentence>
</Sentences>
</document>
Petrarch接受格式,程序运行没有错误,但没有输出。下面是我的python代码:
cd
virtualenv venv
source venv/bin/activate
petrarch parse -i reuter1025.xml -o output.txt
以下是我从终端复制的日志:
(venv)d-172-26-7-114:~ Carl$ petrarch parse -i reuter1025.xml -o output.txt
new_actor_length = 0
stop_on_error = False
write_actor_root = False
write_actor_text = False
require_dyad = True
code-by-sentence True
pause_by_sentence False
pause_by_story False
Comma-delimited clause elimination:
Initial : deactivated
Internal: min = 2 max = 8
Terminal: min = 2 max = 8
Verb dictionary: CAMEO.verbpatterns.150430.txt
Actor dictionaries: [u'Phoenix.Countries.actors.txt', u'Phoenix.International.actors.txt', u'Phoenix.MilNonState.actors.txt']
Agent dictionary: Phoenix.agents.txt
Discard dictionary: Phoenix.discards.txt
Issues dictionary: Phoenix.IssueCoding.txt
Setting up StanfordNLP. The program isn't dead. Promise.
Stanford setup complete. Starting parse of 3 stories...
Done with StanfordNLP parse...
Discard sentence: CHINA FIRST 2012
Discard sentence: CHINA FIRST 2012
Summary:
Stories read: 0 Sentences coded: 0 Events generated: 0
Discards: Sentence 2 Story 0 Sentences without events: 0
Coding time: 5.003469944
Finished
问题似乎是StanfordNLP无视我的所有句子。对于那些有经验的人,我的原始格式有什么问题吗?我真的很想做这项工作,任何想法都会受到赞赏!