我有以下代码尝试解析XML文件并转换为表格形式。
import xml.etree.ElementTree as ET
tree = ET.parse('smp.xml')
root = tree.getroot()
for text in root.iter('text'):
print(text.attrib)
for text in root.iter('text'):
print(text.text)
下面是我到目前为止的输出,但与期望的输出相差甚远,因为我是python和
我不知道如何组织这些输出以显示列表,并另外在左侧添加page
,row
和column
父元素的列
对应于每个文本/属性。
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('smp.xml')
>>> root = tree.getroot()
>>>
>>> for text in root.iter('text'):
... print(text.attrib)
...
{'width': '71.04', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '83.42', 'x': '121.10', 'height': '12.00'}
{'width': '101.07', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '124.82', 'x': '121.10', 'height': '12.00'}
{'width': '140.31', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '207.65', 'x': '121.10', 'height': '12.00'}
{'width': '24.36', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '69.62', 'x': '85.10', 'height': '12.00'}
{'width': '95.42', 'fontName': 'Arial', 'fontStyle': 'Bold', 'fontSize': '12.0', 'y': '239.45', 'x': '276.29', 'height': '12.00'}
{'width': '229.57', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '266.81', 'x': '121.10', 'height': '12.00'}
{'width': '155.71', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '266.81', 'x': '353.94', 'height': '12.00'}
{'width': '165.10', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '294.41', 'x': '85.10', 'height': '12.00'}
{'width': '14.39', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '294.41', 'x': '253.43', 'height': '12.00'}
{'width': '255.64', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '294.41', 'x': '271.04', 'height': '12.00'}
{'width': '432.97', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '501.43', 'x': '85.10', 'height': '12.00'}
{'width': '363.44', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '69.62', 'x': '85.10', 'height': '12.00'}
{'width': '382.36', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '83.42', 'x': '85.10', 'height': '12.00'}
>>> for text in root.iter('text'):
... print(text.text)
...
achene
capsule
caryopsis
cypsela
fibrous drupe
follicle
legume
loment
nut
samara
schizocarp
silicle
utricle
这是我的预期输出:
╔══════╦═══════╦═════╦════════╦════════════╦══════════╦══════════╦════════╦════════╦════════╦════════╦═══════════╗
║ page ║ index ║ row ║ column ║ text ║ fontName ║ fontSize ║ x ║ y ║ width ║ height ║ fontStyle ║
╠══════╬═══════╬═════╬════════╬════════════╬══════════╬══════════╬════════╬════════╬════════╬════════╬═══════════╣
║ 0 ║ 0 ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║
║ 1 ║ 1 ║ 0 ║ 0 ║ achene ║ Arial ║ 12 ║ 121.1 ║ 83.42 ║ 71.04 ║ 12 ║ ║
║ 1 ║ 1 ║ 1 ║ 0 ║ capsule ║ Arial ║ 12 ║ 121.1 ║ 124.82 ║ 101.07 ║ 12 ║ ║
║ 1 ║ 1 ║ 2 ║ 0 ║ caryopsis ║ Arial ║ 12 ║ 121.1 ║ 207.65 ║ 140.31 ║ 12 ║ ║
║ 2 ║ 2 ║ 0 ║ 0 ║ cypsela ║ Arial ║ 12 ║ 85.1 ║ 69.62 ║ 24.36 ║ 12 ║ ║
║ 3 ║ 3 ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║
║ 4 ║ 4 ║ 0 ║ 0 ║ fibrous ║ Arial ║ 12 ║ 276.29 ║ 239.45 ║ 95.42 ║ 12 ║ Bold ║
║ 4 ║ 4 ║ 1 ║ 1 ║ follicle ║ Arial ║ 12 ║ 121.1 ║ 266.81 ║ 229.57 ║ 12 ║ ║
║ 4 ║ 4 ║ 1 ║ 1 ║ legume ║ Arial ║ 12 ║ 353.94 ║ 266.81 ║ 155.71 ║ 12 ║ ║
║ 4 ║ 4 ║ 2 ║ 2 ║ loment ║ Arial ║ 12 ║ 85.1 ║ 294.41 ║ 165.1 ║ 12 ║ ║
║ 4 ║ 4 ║ 2 ║ 2 ║ nut ║ Arial ║ 12 ║ 253.43 ║ 294.41 ║ 14.39 ║ 12 ║ ║
║ 4 ║ 4 ║ 2 ║ 2 ║ samara ║ Arial ║ 12 ║ 271.04 ║ 294.41 ║ 255.64 ║ 12 ║ ║
║ 4 ║ 4 ║ 3 ║ 0 ║ schizocarp ║ Arial ║ 12 ║ 85.1 ║ 501.43 ║ 432.97 ║ 12 ║ ║
║ 5 ║ 5 ║ 0 ║ 0 ║ silicle ║ Arial ║ 12 ║ 85.1 ║ 69.62 ║ 363.44 ║ 12 ║ ║
║ 5 ║ 5 ║ 1 ║ 1 ║ utricle ║ Arial ║ 12 ║ 85.1 ║ 83.42 ║ 382.36 ║ 12 ║ ║
║ 6 ║ 6 ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║
╚══════╩═══════╩═════╩════════╩════════════╩══════════╩══════════╩════════╩════════╩════════╩════════╩═══════════╝
这是xml文件:
<document>
<page index="0"/>
<page index="1">
<row><column><text fontName="Arial" fontSize="12.0" x="121.10" y="83.42" width="71.04" height="12.00">achene</text></column></row>
<row><column><text fontName="Arial" fontSize="12.0" x="121.10" y="124.82" width="101.07" height="12.00">capsule</text></column></row>
<row><column><text fontName="Arial" fontSize="12.0" x="121.10" y="207.65" width="140.31" height="12.00">caryopsis</text></column></row>
</page>
<page index="2">
<row><column><text fontName="Arial" fontSize="12.0" x="85.10" y="69.62" width="24.36" height="12.00">cypsela</text></column></row>
</page>
<page index="3"/>
<page index="4">
<row><column><text fontName="Arial" fontSize="12.0" fontStyle="Bold" x="276.29" y="239.45" width="95.42" height="12.00">fibrous drupe</text></column></row>
<row><column><text fontName="Arial" fontSize="12.0" x="121.10" y="266.81" width="229.57" height="12.00">follicle</text></column>
<column><text fontName="Arial" fontSize="12.0" x="353.94" y="266.81" width="155.71" height="12.00">legume</text></column></row>
<row><column><text fontName="Arial" fontSize="12.0" x="85.10" y="294.41" width="165.10" height="12.00">loment – a type of indehiscent legume</text></column>
<column><text fontName="Arial" fontSize="12.0" x="253.43" y="294.41" width="14.39" height="12.00">nut</text></column>
<column><text fontName="Arial" fontSize="12.0" x="271.04" y="294.41" width="255.64" height="12.00">samara</text></column></row>
<row><column><text fontName="Arial" fontSize="12.0" x="85.10" y="501.43" width="432.97" height="12.00">schizocarp</text></column></row>
</page>
<page index="5">
<row><column><text fontName="Arial" fontSize="12.0" x="85.10" y="69.62" width="363.44" height="12.00">silicle</text></column></row>
<row><column><text fontName="Arial" fontSize="12.0" x="85.10" y="83.42" width="382.36" height="12.00">utricle</text></column></row>
</page>
<page index="6"/>
</document>
在此先感谢您的帮助。
答案 0 :(得分:1)
这应该使您足够接近:
import pandas as pd
import xml.etree.ElementTree as ET
etree = ET.fromstring(xml_string)
df = pd.DataFrame()
for j in etree.iter('page'):
for i in j.iter('text'):
dfcols = ['index','text','fontName','fontSize','x','y','width','height','fontStyle']
df = df.append(pd.Series([j.get('index'),i.text,i.get('fontName'),i.get('fontSize'),i.get('x'),i.get('y'),i.get('width'),i.get('height'),i.get('fontStyle')],index=dfcols), ignore_index=True)
df = df[dfcols]
df.head()
输出:
index text fontName fontSize x y width height fontStyle
0 1 achene Arial 12.0 121.10 83.42 71.04 12.00 None
1 1 capsule Arial 12.0 121.10 124.82 101.07 12.00 None
2 1 caryopsis Arial 12.0 121.10 207.65 140.31 12.00 None
3 2 cypsela Arial 12.0 85.10 69.62 24.36 12.00 None
4 4 fibrous drupe Arial 12.0 276.29 239.45 95.42 12.00 Bold