Question

我在解析下面的xml文件时遇到问题。这是我尝试过的;

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<corpus name="P4P" version="1.0" lng="en" xmlns="http://clic.ub.edu/mbertran/formats/paraphrase-corpus"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://clic.ub.edu/mbertran/
formats/paraphrase-corpus http://clic.ub.edu/mbertran/formats/paraphrase-corpus.xsd">
    <snippets>
        <snippet id="16488"    source_description="type:plagiarism;plagiarism_reference:00061;
        offset:47727;length:182;source:P4P;wd_count:37">
        All art is imitation of nature.
        </snippet>

    </snippets>
</corpus>

import xml.etree.ElementTree
#root=xml.etree.ElementTree.parse("C:\\Users\\P4P_corpus\\P4P_corpus_v1.xml").getroot()
source=root.findall('snippets/snippet')
for details in source.findall:
    print details.get('source_description')
    print details.findtext

我的输出是空的

我想要的输出：

"type:plagiarism;plagiarism_reference:00061;
        offset:47727;length:182;source:P4P;wd_count:37"

和All art is imitation of nature.

我非常感谢你的建议。

Answer 1

您需要使用xml命名空间为元素添加前缀。如果您在解析后打印root，那么

   <Element '{http://clic.ub.edu/mbertran/formats/paraphrase-corpus}corpus' at 0x7ff7891f6390>
            ^       this part here is the full name                       ^

所以要迭代＆＃39;摘要＆＃39;你首先找到的元素＆＃39;片段＆＃39;元素和＆＃39;片段＆＃39;元素

for snippets in root.findall('{http://clic.ub.edu/mbertran/formats/paraphrase-corpus}snippets'):
    for s in snippets.findall('{http://clic.ub.edu/mbertran/formats/paraphrase-corpus}snippet'):
        print s.get('source_description')

您可以阅读有关处理名称空间@ https://docs.python.org/2/library/xml.etree.elementtree.html#parsing-xml-with-namespaces

的信息

应用元素树来解析复杂的xml结构

1 个答案: