如何使用python 2.7迭代xml项的多个子节点

时间:2015-06-05 18:40:47

标签: python xml

我正试图以

的形式解析USPTO中不完美结构的XML数据
<parent>
 <child>
  <child-text>text
  <child-text>more text</child-text>
  <child-text>more text</child-text>
  </child-text>
 </child>
</parent>

我正在尝试捕获子文本节点的所有文本。但正如您所看到的,第一个子文本标记在所有剩余标记完成之后才会关闭。以下摘录是一个例子:

<claims id="claims">
  <claim id="CLM-00001" num="00001">
    <claim-text>1. An all-solid-state electrochromic device comprising:
    <claim-text>a transparent base material; and</claim-text>
    <claim-text>an electrochromic multilayer-stack structure formed on the transparent base material, the electrochromic multilayer-stack structure comprising:
    <claim-text>a first transparent-conductive film;</claim-text>
    <claim-text>an ion-storage layer formed on the first transparent-conductive film;</claim-text>
    <claim-text>a solid-electrolyte layer formed on the ion-storage layer; and</claim-text>
    <claim-text>an electrochromic layer formed on the solid-electrolyte layer, the electrochromic layer comprising a reflection-controllable electrochromic layer comprising an antimony-based alloy comprising Sb<sub>x</sub>CoLi<sub>y </sub>in which 0.5&#x2266;x&#x2266;10, and 0.1&#x2266;y&#x2266;10.</claim-text>
    </claim-text>
    </claim-text>
  </claim>
<claim id="CLM-00002" num="00002">
<claim-text>2. The all-solid-state electrochromic device according to <claim-ref idref="CLM-00001">claim 1</claim-ref>, wherein 3&#x2266;x&#x2266;5 and 0.1&#x2266;y&#x2266;3.</claim-text>
</claim>
</claims>

我目前的做法是仅捕获第一个标记的内容,并且没有充分捕获子元素的内容(例如上面的例子中):

claims = self.xml.claim
for i, claim in enumerate(claims):
        data = {}
        data['text'] = claim.contents_of('claim_text', as_string=True, upper=False)

尽管结构不一致,我如何遍历所有<claim-text>标签和<claim-ref>子标签?

1 个答案:

答案 0 :(得分:0)

我有与xml文档类似的问题。我做的是

+----+--------------+---------------+ | post_content | post_title | +----+--------------+---------------+ | -td- MSOR -/td- | -RUST-NAVY | | -td- NBLA -/td- | -SAND-SAND | | -td- SHZA -/td- | -IVORY-BLACK | | -td- UKRN -/td- | -IVORY-RUST | +----+------------------------------+

这将返回xml标记内的内容

然后使用if语句

删除xml标记内容中的任何额外标记
<xml_document>[<xml_document>.find("<claim-text>")+len(<claim-text>):<xml_document>.find("</claim-text>")]
每次迭代

通过索引删除xml_document的解析部分。