Question

<root>
  <article>
    <front>
      <body>
        <back>
          <sec id="sec7" sec-type="funding">
            <title>Funding</title>
            <p>This work was supported by the NIH</p>
          </sec>
        </back>

我有一个科学期刊元数据的XML文件，正在尝试仅提取每篇文章的资助信息。我需要p标记中包含的信息。虽然“ sec id”在文章之间会有所不同，但“ sec-type”始终是“资金”。

我一直在尝试使用元素树在Python3中做到这一点。

import xml.etree.ElementTree as ET  

tree = ET.parse(journals.xml)
root = tree.getroot()
for title in root.iter("title"):
    ET.dump(title)

任何帮助将不胜感激！

Answer 1

您可以将findall与XPath表达式一起使用以提取所需的值。我从您的示例数据中推断出一点点，以便完成文档并包含两个p元素：

<root>
  <article>
    <front>
      <body>
        <back>
          <sec id="sec7" sec-type="funding">
            <title>Funding</title>
            <p>This work was supported by the NIH</p>
          </sec>
          <sec id="sec8" sec-type="funding">
            <title>Funding</title>
            <p>I'm a little teapot</p>
          </sec>
        </back>
      </body>
    </front>
  </article>
</root>

以下内容提取p节点（其中sec下的sectype="funding"个节点的所有文本内容：

import xml.etree.ElementTree as ET

doc = ET.parse('journals.xml')
print([p.text for p in doc.findall('.//sec[@sec-type="funding"]/p')])

结果：

['This work was supported by the NIH', "I'm a little teapot"]

使用Python查找子元素的特定XML属性？

1 个答案: