逐块解析xml文件并获取每个块内的值

时间:2017-03-27 23:01:36

标签: python xml parsing xml-parsing

我有一个10 GB的xml文件,其中包含不同块的列表。这是我的文件片段:

<?xml version="1.0" encoding="UTF-8"?>

<?import java.lang.*?>
<?import javafx.scene.control.*?>
<?import javafx.scene.layout.*?>
<?import javafx.scene.layout.VBox?>

<GridPane xmlns:fx="http://javafx.com/fxml/1" xmlns="http://javafx.com/javafx/2.2" fx:controller="application.Main">
  <children>
    <GridPane>
      <children>
        <GridPane>
          <children>
            <TextArea editable="false" mouseTransparent="false" prefWidth="200.0" text="Monday" wrapText="true" GridPane.columnIndex="1" GridPane.rowIndex="0" />
            <TextArea editable="false" mouseTransparent="false" prefWidth="200.0" text="Tuesday" wrapText="true" GridPane.columnIndex="2" GridPane.rowIndex="0" />
            <TextArea editable="false" mouseTransparent="false" prefWidth="200.0" text="Wednesday" wrapText="true" GridPane.columnIndex="3" GridPane.rowIndex="0" />
            <TextArea editable="false" mouseTransparent="false" prefWidth="200.0" text="Thursday" wrapText="true" GridPane.columnIndex="4" GridPane.rowIndex="0" />
            <TextArea editable="false" mouseTransparent="false" prefWidth="200.0" text="Friday" wrapText="true" GridPane.columnIndex="5" GridPane.rowIndex="0" />
            <TextArea editable="false" mouseTransparent="false" prefWidth="200.0" text="Saturday" wrapText="true" GridPane.columnIndex="6" GridPane.rowIndex="0" />
            <TextArea editable="false" mouseTransparent="false" prefWidth="200.0" text="Sunday" wrapText="true" GridPane.columnIndex="7" GridPane.rowIndex="0" />
            <TextArea editable="false" mouseTransparent="false" prefWidth="200.0" text="Monday" wrapText="true" GridPane.columnIndex="1" GridPane.rowIndex="9" />
            <TextArea editable="false" mouseTransparent="false" prefWidth="200.0" text="Tuesday" wrapText="true" GridPane.columnIndex="2" GridPane.rowIndex="9" />
            <TextArea editable="false" mouseTransparent="false" prefWidth="200.0" text="Wednesday" wrapText="true" GridPane.columnIndex="3" GridPane.rowIndex="9" />
            <TextArea editable="false" mouseTransparent="false" prefWidth="200.0" text="Thursday" wrapText="true" GridPane.columnIndex="4" GridPane.rowIndex="9" />
            <TextArea editable="false" mouseTransparent="false" prefWidth="200.0" text="Friday" wrapText="true" GridPane.columnIndex="5" GridPane.rowIndex="9" />
            <TextArea editable="false" mouseTransparent="false" prefWidth="200.0" text="Saturday" wrapText="true" GridPane.columnIndex="6" GridPane.rowIndex="9" />
            <TextArea editable="false" mouseTransparent="false" prefWidth="200.0" text="Sunday" wrapText="true" GridPane.columnIndex="7" GridPane.rowIndex="9" />
            <Button id="prev" fx:id="prev2" mnemonicParsing="false" onAction="#ClickMinus" prefHeight="30.0" prefWidth="70.0" text="prev" GridPane.columnIndex="8" GridPane.rowIndex="0" />
            <Button fx:id="next" mnemonicParsing="false" onAction="#ClickPlus" prefHeight="29.999900000002526" prefWidth="70.00009999999747" text="next" GridPane.columnIndex="0" GridPane.rowIndex="9" />
            <Button fx:id="next2" mnemonicParsing="false" onAction="#ClickPlus" prefHeight="30.0" prefWidth="70.0" text="next" GridPane.columnIndex="8" GridPane.rowIndex="9" />
            <Button fx:id="prev" mnemonicParsing="false" onAction="#ClickMinus" prefHeight="30.0" prefWidth="70.0" text="prev" GridPane.columnIndex="0" GridPane.rowIndex="0" />
            <TextArea fx:id="week1" prefWidth="200.0" text="Week x Year x" wrapText="true" GridPane.columnIndex="0" GridPane.rowIndex="2" />
            <TextArea fx:id="week2" prefWidth="200.0" text="Week x Year x" wrapText="true" GridPane.columnIndex="0" GridPane.rowIndex="4" />
            <TextArea fx:id="week4" prefWidth="200.0" text="Week x Year x" wrapText="true" GridPane.columnIndex="0" GridPane.rowIndex="8" />
            <Label fx:id="lab11" onMouseClicked="#labClick" prefHeight="40.0" prefWidth="150.0" text="" GridPane.columnIndex="1" GridPane.rowIndex="1" />
            <Label fx:id="lab12" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="2" GridPane.rowIndex="1" />
            <Label fx:id="lab13" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="3" GridPane.rowIndex="1" />
            <Label fx:id="lab14" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="4" GridPane.rowIndex="1" />
            <Label fx:id="lab15" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="5" GridPane.rowIndex="1" />
            <Label fx:id="lab21" minHeight="13.0" onMouseClicked="#labClick" prefHeight="40.0" prefWidth="149.9998779296875" text="Label" GridPane.columnIndex="1" GridPane.rowIndex="3" />
            <Label fx:id="lab22" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="2" GridPane.rowIndex="3" />
            <Label fx:id="lab23" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="3" GridPane.rowIndex="3" />
            <Label fx:id="lab32" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="2" GridPane.rowIndex="5" />
            <Label fx:id="lab31" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="1" GridPane.rowIndex="5" />
            <Label fx:id="lab33" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="3" GridPane.rowIndex="5" />
            <Label fx:id="lab34" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="4" GridPane.rowIndex="5" />
            <Label fx:id="lab24" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="4" GridPane.rowIndex="3" />
            <Label fx:id="lab25" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="5" GridPane.rowIndex="3" />
            <Label fx:id="lab35" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="5" GridPane.rowIndex="5" />
            <Label fx:id="lab41" onMouseClicked="#labClick" prefHeight="40.000099999997474" prefWidth="150.0" text="Label" GridPane.columnIndex="1" GridPane.rowIndex="7" />
            <Label fx:id="lab42" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="2" GridPane.rowIndex="7" />
            <Label fx:id="lab16" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="6" GridPane.rowIndex="1" />
            <Label fx:id="lab17" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="7" GridPane.rowIndex="1" />
            <Label fx:id="lab26" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="6" GridPane.rowIndex="3" />
            <Label fx:id="lab43" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="3" GridPane.rowIndex="7" />
            <Label fx:id="lab44" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="4" GridPane.rowIndex="7" />
            <Label fx:id="lab45" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="5" GridPane.rowIndex="7" />
            <Label fx:id="lab36" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="6" GridPane.rowIndex="5" />
            <Label fx:id="lab46" onMouseClicked="#labClick" prefHeight="39.9998779296875" prefWidth="150.0" text="Label" GridPane.columnIndex="6" GridPane.rowIndex="7" />
            <Label fx:id="lab47" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="7" GridPane.rowIndex="7" />
            <Label fx:id="lab37" onMouseClicked="#labClick" prefHeight="40.000099999997474" prefWidth="139.0" text="Label" GridPane.columnIndex="7" GridPane.rowIndex="5" />
            <Label fx:id="lab27" onMouseClicked="#labClick" prefHeight="44.0" prefWidth="139.0" text="Label" GridPane.columnIndex="7" GridPane.rowIndex="3" />
            <TextArea fx:id="week3" prefHeight="100.00009999999747" prefWidth="70.0" text="Week x Year x" wrapText="true" GridPane.columnIndex="0" GridPane.rowIndex="6" />
            <Button fx:id="start" mnemonicParsing="false" onAction="#ClickStart" prefHeight="30.0" prefWidth="70.0" text="Start" GridPane.columnIndex="0" GridPane.rowIndex="1" />
            <VBox fx:id="vb11" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="1" GridPane.rowIndex="2" />
            <VBox fx:id="vb12" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="2" GridPane.rowIndex="2" />
            <VBox id="vb12" fx:id="vb13" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="3" GridPane.rowIndex="2" />
            <VBox id="vb12" fx:id="vb21" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="1" GridPane.rowIndex="4" />
            <VBox id="vb12" fx:id="vb22" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="2" GridPane.rowIndex="4" />
            <VBox id="vb12" fx:id="vb23" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="3" GridPane.rowIndex="4" />
            <VBox id="vb12" fx:id="vb25" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="5" GridPane.rowIndex="4" />
            <VBox id="vb12" fx:id="vb31" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="1" GridPane.rowIndex="6" />
            <VBox id="vb12" fx:id="vb32" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="2" GridPane.rowIndex="6" />
            <VBox id="vb12" fx:id="vb14" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="4" GridPane.rowIndex="2" />
            <VBox id="vb12" fx:id="vb15" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="5" GridPane.rowIndex="2" />
            <VBox id="vb12" fx:id="vb35" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="5" GridPane.rowIndex="6" />
            <VBox id="vb12" fx:id="vb33" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="3" GridPane.rowIndex="6" />
            <VBox id="vb12" fx:id="vb41" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="1" GridPane.rowIndex="8" />
            <VBox id="vb12" fx:id="vb42" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="2" GridPane.rowIndex="8" />
            <VBox id="vb12" fx:id="vb43" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="3" GridPane.rowIndex="8" />
            <VBox id="vb12" fx:id="vb44" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="4" GridPane.rowIndex="8" />
            <VBox id="vb12" fx:id="vb45" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="5" GridPane.rowIndex="8" />
            <VBox id="vb12" fx:id="vb24" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="4" GridPane.rowIndex="4" />
            <VBox id="vb12" fx:id="vb26" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="6" GridPane.rowIndex="4" />
            <VBox id="vb12" fx:id="vb36" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="6" GridPane.rowIndex="6" />
            <VBox id="vb12" fx:id="vb16" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="6" GridPane.rowIndex="2" />
            <VBox id="vb12" fx:id="vb17" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="7" GridPane.rowIndex="2" />
            <VBox id="vb12" fx:id="vb27" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="7" GridPane.rowIndex="4" />
            <VBox id="vb12" fx:id="vb37" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="7" GridPane.rowIndex="6" />
            <VBox id="vb12" fx:id="vb34" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="4" GridPane.rowIndex="6" />
            <VBox id="vb12" fx:id="vb46" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="6" GridPane.rowIndex="8" />
            <VBox id="vb12" fx:id="vb47" prefHeight="200.0" prefWidth="100.0" GridPane.columnIndex="7" GridPane.rowIndex="8" />
          </children>
          <columnConstraints>
            <ColumnConstraints hgrow="SOMETIMES" minWidth="10.0" prefWidth="70.0" />
            <ColumnConstraints hgrow="SOMETIMES" minWidth="10.0" prefWidth="150.0" />
            <ColumnConstraints hgrow="SOMETIMES" minWidth="10.0" prefWidth="150.0" />
            <ColumnConstraints hgrow="SOMETIMES" minWidth="10.0" prefWidth="150.0" />
            <ColumnConstraints hgrow="SOMETIMES" minWidth="10.0" prefWidth="150.0" />
            <ColumnConstraints hgrow="SOMETIMES" minWidth="10.0" prefWidth="150.0" />
            <ColumnConstraints hgrow="SOMETIMES" minWidth="10.0" prefWidth="150.0" />
            <ColumnConstraints hgrow="SOMETIMES" minWidth="10.0" prefWidth="150.0" />
            <ColumnConstraints hgrow="SOMETIMES" minWidth="10.0" prefWidth="70.0" />
          </columnConstraints>
          <rowConstraints>
            <RowConstraints maxHeight="40.0" minHeight="10.0" prefHeight="40.0" vgrow="SOMETIMES" />
            <RowConstraints maxHeight="40.0" minHeight="10.0" prefHeight="40.0" vgrow="SOMETIMES" />
            <RowConstraints maxHeight="100.0" minHeight="10.0" prefHeight="100.0" vgrow="SOMETIMES" />
            <RowConstraints maxHeight="40.0" minHeight="10.0" prefHeight="40.0" vgrow="SOMETIMES" />
            <RowConstraints maxHeight="100.0" minHeight="10.0" prefHeight="100.0" vgrow="SOMETIMES" />
            <RowConstraints maxHeight="40.0" minHeight="10.0" prefHeight="40.0" vgrow="SOMETIMES" />
            <RowConstraints maxHeight="100.0" minHeight="10.0" prefHeight="100.0" vgrow="SOMETIMES" />
            <RowConstraints maxHeight="40.0" minHeight="10.0" prefHeight="40.0" vgrow="SOMETIMES" />
            <RowConstraints maxHeight="100.0" minHeight="10.0" prefHeight="100.0" vgrow="SOMETIMES" />
            <RowConstraints maxHeight="40.0" minHeight="10.0" prefHeight="40.0" vgrow="SOMETIMES" />
          </rowConstraints>
        </GridPane>
      </children>
      <columnConstraints>
        <ColumnConstraints hgrow="SOMETIMES" minWidth="10.0" />
      </columnConstraints>
      <rowConstraints>
        <RowConstraints minHeight="10.0" vgrow="SOMETIMES" />
      </rowConstraints>
    </GridPane>
  </children>
  <columnConstraints>
    <ColumnConstraints hgrow="SOMETIMES" minWidth="10.0" />
  </columnConstraints>
  <rowConstraints>
    <RowConstraints minHeight="10.0" vgrow="SOMETIMES" />
  </rowConstraints>
</GridPane>

所以我的目标是使用 iterparse celementtree 以序列化模式解析我的文件,但希望一次获取每个块。例如,我喜欢获取 image 的整个块,然后解析该块内的值。 例如,我需要获取第一个图像块(<image> <ref>www.test.com</ref> <label/> <number>0</number> <ID>ID0</ID> <name>test1</name> <comment> <line number="0">This is a comment</line> <line number="1">This is also another comment</line> </comment> <creationDate>2017-02-13T15:46:16-04:00</creationDate> </image> <result> <ref>www.test1.com</ref> <label/> <number>001</number> <ID>RE1</ID> <name>test2</name> <comment> <line number="0">This is a comment2</line> </comment> <creationDate>2017-01-13T15:46:16-04:00</creationDate> </result> <image> <ref>www.test3.com</ref> <label/> <number>1</number> <ID>ID1</ID> <value>10030</value> <name>test3</name> <comment> <line number="0">This is a comment3</line> </comment> <creationDate>2017-04-13T15:46:16-04:00</creationDate> </image> )块然后打印其中的值 www.test.com,0,id0,test1,这是一个注释和2017-02- 13T15:46:16-04:00

所以我使用了以下代码,但似乎它只是逐行读取xml文件,也无法打印每行或元素中的值:

*<image>... </image>*

你能帮我解决这个问题。我是xml解析的新手。 我还想将每个解析的块转换为python中的字典。有可能吗?

1 个答案:

答案 0 :(得分:0)

它不是“逐行”读取XML文件。 在每个元素的末尾返回end事件。也就是说,如果您的输入文件如下所示:

<data>
  <widgets location="earth">
    <widget name="gizmo"/>
    <widget name="gadget"/>
    <widget name="thingamajig"/>
  </widgets>
</data>

从简单调用到iterparse的返回值序列为:

end <Element widget at 0x7f31e3132488>
end <Element widget at 0x7f31e3123f38>
end <Element widget at 0x7f31e3123ef0>
end <Element widgets at 0x7f31e31327a0>
end <Element data at 0x7f31e31324d0>

如果您愿意,还可以在每个元素的开头receive start events,如下所示:

for event, element in etree.iterparse(fd, events=('start', 'end')):
    print event, element

输出为:

start <Element data at 0x7fccf78cc518>
start <Element widgets at 0x7fccf78cc7e8>
start <Element widget at 0x7fccf78cc4d0>
end <Element widget at 0x7fccf78cc4d0>
start <Element widget at 0x7fccf78bdf80>
end <Element widget at 0x7fccf78bdf80>
start <Element widget at 0x7fccf78bdf38>
end <Element widget at 0x7fccf78bdf38>
end <Element widgets at 0x7fccf78cc7e8>
end <Element data at 0x7fccf78cc518>

如果我想为每个widgets构建location列表,那么我可能希望通过初始化列表,然后附加每个新窗口小部件来响应start事件到达那个列表,直到我到达结束元素,如:

from lxml import etree

with open('data2.xml') as fd:
    widgets = {}
    loc = None

    for event, element in etree.iterparse(fd, events=('start', 'end')):
        if event == 'start' and element.tag == 'widgets':
            loc = element.get('location')
            widgets[loc] = []
        elif event == 'end' and element.tag == 'widget':
            widgets[loc].append(element.get('name'))

    print widgets

其输出为:

{'earth': ['gizmo', 'gadget', 'thingamajig']}

我希望这能让您了解如何处理输入文件中的每个感兴趣的块。