Question

我的问题是来自here的补充说明，但我并不打算将答案部分用于附加问题。

如果我有这样的XML文件的一部分：

  <eligibility>
    <criteria>
      <textblock>
        Inclusion Criteria:

          -  women undergoing cesarean section for any indication

          -  literate in german language

        Exclusion Criteria:

          -  history of keloids

          -  previous transversal suprapubic scars

          -  known patient hypersensitivity to any of the suture materials used in the protocol

          -  a medical disorder that could affect wound healing (eg, diabetes mellitus, chronic
             corticosteroid use)
      </textblock>
    </criteria>
    <gender>Female</gender>
    <minimum_age>18 Years</minimum_age>
    <maximum_age>45 Years</maximum_age>
    <healthy_volunteers>No</healthy_volunteers>
  </eligibility>

我想提取此资格部分中的所有字符串（即文本块部分中的字符串以及性别，最低年龄，最大年龄和健康志愿者部分）

使用上面的代码我做了这个：

import sys
from bs4 import BeautifulSoup

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'lxml')
eligibi = []

for eligibility in soup.find_all('eligibility'):
    d = {'other_name':eligibility.criteria.textblock.string, 'gender':eligibility.gender.string}
    eligibi.append(d)

print eligibi

我的问题是我有很多文件。有时XML文件的结构可能是：

eligibility -> criteria -> textblock -> text
eligibility -> other things (e.g. gender as above) -> text
eligibility -> text

e.g。如果有办法采取所有的小标题及其文本＆＃39;

所以在上面的例子中，列表/字典将包含： {标准文本块：包含和排除标准，性别：xxx，minimum_age：xxx，maximum_age：xxx，healthy_volunteers：xxx}

我的问题是，实际上，我不会知道资格标签的所有特定子标签，因为每个实验可能会有所不同（例如，可能有些人说'孕妇接受了'＆＃39;，＆＃39; XXX的药物历史被接受＆＃39;等等。

所以我只想要，如果我给它一个标签名称，它会在字典中给我所有子标签和子标签的文本。

用于评论的扩展XML：

<brief_title>Subcutaneous Adaption and Cosmetic Outcome Following Caesarean Delivery</brief_title>
<source>Klinikum Klagenfurt am Wörthersee</source>

...然后是上面的资格XML部分。

Answer 1

由于您已安装lxml，因此您可以尝试以下操作（此代码假定给定元素中的叶元素，eligibility是唯一的）：

from lxml import etree
tree = etree.parse(sys.argv[1])
root = tree.getroot()

eligibi = []

for eligibility in root.xpath('//eligibility'):
    d = {}
    for e in eligibility.xpath('.//*[not(*)]'):
        d[e.tag] = e.text
    eligibi.append(d)

print eligibi

XPath解释：

.//* ：查找当前eligibility内的所有元素，无论其深度（//）和标记名称（*）
[not(*)] ：将前一位找到的元素过滤给那些没有任何子元素即叶元素的元素

迭代python中XML标记的所有子标记和字符串，而不指定子标记名称

1 个答案: