我正试图以
的形式解析USPTO中不完美结构的XML数据<parent>
<child>
<child-text>text
<child-text>more text</child-text>
<child-text>more text</child-text>
</child-text>
</child>
</parent>
我正在尝试捕获子文本节点的所有文本。但正如您所看到的,第一个子文本标记在所有剩余标记完成之后才会关闭。以下摘录是一个例子:
<claims id="claims">
<claim id="CLM-00001" num="00001">
<claim-text>1. An all-solid-state electrochromic device comprising:
<claim-text>a transparent base material; and</claim-text>
<claim-text>an electrochromic multilayer-stack structure formed on the transparent base material, the electrochromic multilayer-stack structure comprising:
<claim-text>a first transparent-conductive film;</claim-text>
<claim-text>an ion-storage layer formed on the first transparent-conductive film;</claim-text>
<claim-text>a solid-electrolyte layer formed on the ion-storage layer; and</claim-text>
<claim-text>an electrochromic layer formed on the solid-electrolyte layer, the electrochromic layer comprising a reflection-controllable electrochromic layer comprising an antimony-based alloy comprising Sb<sub>x</sub>CoLi<sub>y </sub>in which 0.5≦x≦10, and 0.1≦y≦10.</claim-text>
</claim-text>
</claim-text>
</claim>
<claim id="CLM-00002" num="00002">
<claim-text>2. The all-solid-state electrochromic device according to <claim-ref idref="CLM-00001">claim 1</claim-ref>, wherein 3≦x≦5 and 0.1≦y≦3.</claim-text>
</claim>
</claims>
我目前的做法是仅捕获第一个标记的内容,并且没有充分捕获子元素的内容(例如上面的例子中):
claims = self.xml.claim
for i, claim in enumerate(claims):
data = {}
data['text'] = claim.contents_of('claim_text', as_string=True, upper=False)
尽管结构不一致,我如何遍历所有<claim-text>
标签和<claim-ref>
子标签?
答案 0 :(得分:0)
我有与xml文档类似的问题。我做的是
+----+--------------+---------------+
| post_content | post_title |
+----+--------------+---------------+
| -td- MSOR -/td- | -RUST-NAVY |
| -td- NBLA -/td- | -SAND-SAND |
| -td- SHZA -/td- | -IVORY-BLACK |
| -td- UKRN -/td- | -IVORY-RUST |
+----+------------------------------+
然后使用if语句
删除xml标记内容中的任何额外标记<xml_document>[<xml_document>.find("<claim-text>")+len(<claim-text>):<xml_document>.find("</claim-text>")]
每次迭代通过索引删除xml_document的解析部分。