我正在尝试构建一个脚本来读取xml文件。 这是我第一次解析xml,我正在将python与xml.etree.ElementTree一起使用。我要处理的文件部分如下所示:
<component>
<section>
<id root="42CB916B-BB58-44A0-B8D2-89B4B27F04DF" />
<code code="34089-3" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="DESCRIPTION SECTION" />
<title mediaType="text/x-hl7-title+xml">DESCRIPTION</title>
<text>
<paragraph>Renese<sup>®</sup> is designated generically as polythiazide, and chemically as 2<content styleCode="italics">H</content>-1,2,4-Benzothiadiazine-7-sulfonamide, 6-chloro-3,4-dihydro-2-methyl-3-[[(2,2,2-trifluoroethyl)thio]methyl]-, 1,1-dioxide. It is a white crystalline substance, insoluble in water but readily soluble in alkaline solution.</paragraph>
<paragraph>Inert Ingredients: dibasic calcium phosphate; lactose; magnesium stearate; polyethylene glycol; sodium lauryl sulfate; starch; vanillin. The 2 mg tablets also contain: Yellow 6; Yellow 10.</paragraph>
</text>
<effectiveTime value="20051214" />
</section>
</component>
<component>
<section>
<id root="CF5D392D-F637-417C-810A-7F0B3773264F" />
<code code="42229-5" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="SPL UNCLASSIFIED SECTION" />
<title mediaType="text/x-hl7-title+xml">ACTION</title>
<text>
<paragraph>The mechanism of action results in an interference with the renal tubular mechanism of electrolyte reabsorption. At maximal therapeutic dosage all thiazides are approximately equal in their diuretic potency. The mechanism whereby thiazides function in the control of hypertension is unknown.</paragraph>
</text>
<effectiveTime value="20051214" />
</section>
</component>
可以从以下位置下载完整文件:
这是我的代码:
import xml.etree.ElementTree as ElementTree
import re
with open("ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535.xml") as f:
xmlstring = f.read()
# Remove the default namespace definition (xmlns="http://some/namespace")
xmlstring = re.sub(r'\sxmlns="[^"]+"', '', xmlstring, count=1)
tree = ElementTree.fromstring(xmlstring)
for title in tree.iter('title'):
print(title.text)
到目前为止,我能够打印标题,但我也想打印标签中捕获的相应文本。
我已经尝试过了:
for title in tree.iter('title'):
print(title.text)
for paragraph in title.iter('paragraph'):
print(paragraph.text)
但是我没有从段落中获得任何输出。
做
for title in tree.iter('title'):
print(title.text)
for paragraph in tree.iter('paragraph'):
print(paragraph.text)
我打印了段落的文本,但是(显然)对于xml结构中找到的每个标题,它们都打印在一起了。
我想找到一种方法来1)识别标题; 2)打印相应的段落。 我该怎么办?
答案 0 :(得分:1)
如果您愿意使用lxml,那么以下是使用XPath的解决方案:
import re
from lxml.etree import fromstring
with open("ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535.xml") as f:
xmlstring = f.read()
xmlstring = re.sub(r'\sxmlns="[^"]+"', '', xmlstring, count=1)
doc = fromstring(xmlstring.encode()) # lxml only accepts bytes input, hence we encode
for title in doc.xpath('//title'): # for all title nodes
title_text = title.xpath('./text()') # get text value of the node
# get all text values of the paragraph nodes that appear lower (//paragraph)
# in the hierarchy than the parent (..) of <title>
paragraphs_for_title = title.xpath('..//paragraph/text()')
print(title_text[0] if title_text else '')
for paragraph in paragraphs_for_title:
print(paragraph)