嗨大家我在解析xml文件并将数据输入sqlite时遇到问题,格式就像我需要在111,AAA,BBB等令牌之前输入chracters
<DOCUMENT>
<PAGE width="544.252" height="634.961" number="1" id="p1">
<MEDIABOX x1="0" y1="0" x2="544.252" y2="634.961"/>
<BLOCK id="p1_b1">
<TEXT width="37.7" height="74.124" id="p1_t1" x="51.1" y="20.8652">
<TOKEN sid="p1_s11" id="p1_w1" font-name="Verdanae" bold="yes" italic="no">111</TOKEN>
</TEXT>
</BLOCK>
<BLOCK id="p1_b3">
<TEXT width="151.267" height="10.725" id="p1_t6" x="24.099" y="572.096">
<TOKEN sid="p1_s35" id="p1_w22" font-name="Verdanae" bold="yes" italic="yes">AAA</TOKEN>
<TOKEN sid="p1_s36" id="p1_w23" font-name="verdanae" bold="yes" italic="no">BBB</TOKEN>
<TOKEN sid="p1_s37" id="p1_w24" font-name="verdanae" bold="yes" italic="no">CCC</TOKEN>
</TEXT>
</BLOCK>
<BLOCK id="p1_b4">
<TEXT width="82.72" height="26" id="p1_t7" x="55.426" y="138.026">
<TOKEN sid="p1_s42" id="p1_w29" font-name="verdanae" bold="yes" italic="no">DDD</TOKEN>
<TOKEN sid="p1_s43" id="p1_w30" font-name="verdanae" bold="yes" italic="no">EEE</TOKEN>
</TEXT>
<TEXT width="101.74" height="26" id="p1_t8" x="55.406" y="162.026">
<TOKEN sid="p1_s45" id="p1_w31" font-name="verdanae" bold="yes" italic="no">FFF</TOKEN>
</TEXT>
<TEXT width="152.96" height="26" id="p1_t9" x="55.406" y="186.026">
<TOKEN sid="p1_s47" id="p1_w32" font-name="verdanae" bold="yes" italic="no">GGG</TOKEN>
<TOKEN sid="p1_s48" id="p1_w33" font-name="verdanae" bold="yes" italic="no">HHH</TOKEN>
</TEXT>
</BLOCK>
</PAGE>
</DOCUMENT>
<。>在.net中完成了3个foreach循环1.对于“DOCUMENT / PAGE / BLOCK”2。“TEXT”3。“TOKEN”然后它进入DB我不知道如何在python中做到这一点我正在尝试使用lxml模块
答案 0 :(得分:1)
你是说这个?:
>>> xml = """<DOCUMENT>
... <PAGE width="544.252" height="634.961" number="1" id="p1">
... <MEDIABOX x1="0" y1="0" x2="544.252" y2="634.961"/>
...
... <BLOCK id="p1_b1">
...
... <TEXT width="37.7" height="74.124" id="p1_t1" x="51.1" y="20.8652">
... <TOKEN sid="p1_s11" id="p1_w1" font-name="Verdanae" bold="yes" italic="no">111</TOKEN>
... </TEXT>
... </BLOCK>
...
... <BLOCK id="p1_b3">
...
... <TEXT width="151.267" height="10.725" id="p1_t6" x="24.099" y="572.096">
... <TOKEN sid="p1_s35" id="p1_w22" font-name="Verdanae" bold="yes" italic="yes">AAA</TOKEN>
... <TOKEN sid="p1_s36" id="p1_w23" font-name="verdanae" bold="yes" italic="no">BBB</TOKEN>
... <TOKEN sid="p1_s37" id="p1_w24" font-name="verdanae" bold="yes" italic="no">CCC</TOKEN>
... </TEXT>
... </BLOCK>
...
... <BLOCK id="p1_b4">
...
... <TEXT width="82.72" height="26" id="p1_t7" x="55.426" y="138.026">
... <TOKEN sid="p1_s42" id="p1_w29" font-name="verdanae" bold="yes" italic="no">DDD</TOKEN>
... <TOKEN sid="p1_s43" id="p1_w30" font-name="verdanae" bold="yes" italic="no">EEE</TOKEN>
... </TEXT>
...
... <TEXT width="101.74" height="26" id="p1_t8" x="55.406" y="162.026">
... <TOKEN sid="p1_s45" id="p1_w31" font-name="verdanae" bold="yes" italic="no">FFF</TOKEN>
... </TEXT>
...
... <TEXT width="152.96" height="26" id="p1_t9" x="55.406" y="186.026">
... <TOKEN sid="p1_s47" id="p1_w32" font-name="verdanae" bold="yes" italic="no">GGG</TOKEN>
... <TOKEN sid="p1_s48" id="p1_w33" font-name="verdanae" bold="yes" italic="no">HHH</TOKEN>
... </TEXT>
... </BLOCK>
... </PAGE>
... </DOCUMENT>"""
>>> from lxml import etree
>>> parsed = etree.fromstring(xml)
>>> tokens = parsed.xpath('//TOKEN/text()')
>>> tokens
['111', 'AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'FFF', 'GGG', 'HHH']
>>>
或者这个?:
>>> parsed = etree.fromstring(xml)
>>> for block in parsed.xpath('//PAGE/BLOCK/TEXT'):
... print block.xpath('./TOKEN/text()')
...
['111']
['AAA', 'BBB', 'CCC']
['DDD', 'EEE']
['FFF']
['GGG', 'HHH']
>>>