Question

我正在使用Python与Stanford NLP合作。所以，我有一个函数输入一些文本文件并将它们转换为xml文件（由Stanford CoreNLP生成）。现在，我想编写另一个输入这些xml文件的函数，并输出包含相同文本但命名实体被其标签替换的相应文件，并用“STOP”标记句子的末尾，删除标点符号。文件的开头也有“STOP”字样。给出xml文件的函数是：

import subprocess
def generate_xml(input,output):
    p = subprocess.Popen('java -cp stanford-corenlp-2012-07-09.jar:stanford-corenlp-2012-07-06-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -filelist /Users/akritibahal/Downloads/stanford-corenlp-2012-07-09/myfile_list.txt -outputDirectory /Users/akritibahal/Downloads/stanford-corenlp-2012-07-09', shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    for line in p.stdout.readlines():
        print line

    retval = p.wait()

该函数将给出一个带有命名实体标签的输出文件：

def process_file(input_xml,output_file):

有人可以帮我解决如何使用命名实体标签获取此类输出文件吗？

Answer 1

我使用minidom解析了CoreNLP的输出。以下是您可能想要使用的一些入门代码，但您可能需要查看https://github.com/dasmith/stanford-corenlp-python

请注意，您需要获取Stanford CoreNLP使用的标记化，因为返回的数据基于句子和标记的偏移量。

from xml.dom import minidom    
xmldoc = minidom.parseString(raw_xml_data)
for sentence_xml in xmldoc.getElementsByTagName('sentences')[0].getElementsByTagName('sentence'):
    parse = parser.parse(sentence_xml.getElementsByTagName('parse')[0].firstChild.nodeValue)
    tokens = [(i,j) for i,j in zip(sentence_xml.getElementsByTagName('tokens')[0].getElementsByTagName('token'),parse.get_leaves())]
    # example for processing dependencies
    elements = sentence_xml.getElementsByTagName('dependencies')
    for element in elements:
        if element.getAttribute('type')=="collapsed-ccprocessed-dependencies":
            dependencies += [i for i in element.getElementsByTagName('dep')]

如何使用Python中的Stanford CoreNLP输出一个命名实体被标签替换的文件？

1 个答案: