元素树的iter()正在跳过随机元素

时间:2015-08-06 06:08:35

标签: python xml parsing text elementtree

我正在尝试使用Python中的Element Tree的iterparse()和iter()函数来解析XML文件。以下是Google云端硬盘中文件的链接:https://drive.google.com/file/d/0B_S2Z7quow3TMl9yUk51ZzZ5UW8/view?usp=sharing

XML文件是关于法庭案件数据的汇编;它被分解为一系列带有标签“n-document”的元素,每个元素都包含包含特定法庭案件数据的子元素。我正在尝试提取所有的文档描述。代码的简化版本如下:

import numpy as np
import pandas as pd
import xml.etree.ElementTree as etree
import re
import csv

for event, elem in etree.iterparse("***fileName***", events=("start", "end")):
    if event == "start":
        if elem.tag == "docket.entry":
            for element in elem.iter():
                print element.tag
                if element.text != None:
                    print element.text
                if element.tail != None:
                    print element.tail
                    print "from tail"
    elem.clear()

问题在于,在第一种情况下(1613 HARVARD LIMITED PARTNERSHIP V.TROLICT OF COLUMBIA ET AL),编号为25的文档描述(它们按降序编号)缺少带有标记的元素的文本和尾部“gateway.image.link”。具体来说,这是我得到的输出。我刚刚在一秒钟后取消了构建并向上滚动到控制台的顶部:

docket.entry
number.block
number
28
image.block
image.gateway.link
gateway.image.link
date
07/19/2007
docket.description
ORDER GRANTING DEFENDANTS' MOTION TO DISMISS AND DENYING PLAINTIFF'S MOTION FOR LEAVE TO FILE A SECOND AMENDED COMPLAINT. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1, ) (ENTERED: 07/19/2007)
docket.entry
number.block
number
27
image.block
image.gateway.link
gateway.image.link
date
07/19/2007
docket.description
MEMORANDUM OPINION. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1) MODIFIED ON 7/19/2007 (LCRWR1, ). (ENTERED: 07/19/2007)
docket.entry
number.block
number
26
image.block
image.gateway.link
gateway.image.link
date
03/31/2007
docket.description
MEMORANDUM ORDER GRANTING DEFENDANTS' MOTION
image.gateway.link
21
gateway.image.link
21
 TO STAY DISCOVERY PENDING RESOLUTION OF DEFENDANTS' DISPOSITIVE MOTION FILED BY PATRICK J. CANAVAN, PAUL E. WATERS. SIGNED BY JUDGE RICHARD W. ROBERTS ON 3/31/07. (LCRWR1) ADDITIONAL ATTACHMENT(S) ADDED ON 4/3/2007 (LCRWR1, ). (ENTERED: 04/02/2007)
from tail
docket.entry
number.block
number
25
image.block
image.gateway.link
gateway.image.link
date
11/15/2005
docket.description
RESPONSE TO DEFENDANTS' NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #
image.gateway.link
docket.entry
number.block
number
24
image.block
image.gateway.link
gateway.image.link
date
11/14/2005
docket.description
NOTIFICATION OF SUPPLEMENTAL AUTHORITY BY DISTRICT OF COLUMBIA, PATRICK J. CANAVAN, PAUL E. WATERS (ATTACHMENTS: #
image.gateway.link
1
gateway.image.link
1
)(MULLEN, MARTHA) (ENTERED: 11/14/2005)
from tail

在条目号25(上面显示的输出底部的第二个)下,它表示:

25
image.block
image.gateway.link
gateway.image.link
date
11/15/2005
docket.description
RESPONSE TO DEFENDANTS' NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #
image.gateway.link

问题在于,如果您查看XML文件本身,您会看到还有一个带有标签“gateway.image.link”的元素紧跟在“image.gateway.link”之后,带有文本和尾部内容,但由于某种原因,iter()函数不会捡起它。奇怪的是,大多数其他的文档描述也有标记“image.gateway.link”的元素,后面紧跟着一个标记为“gateway.image.link”的元素,正如您可以从条目号24(以及其余的他们),iter()函数识别那些但不是这一个。以下是我在Google云端硬盘文档中摘录的XML代码,其中粘贴了上面的链接:

<?xml version="1.0" encoding="UTF-8" ?><n-extract-response>
<docket.entries.block><label>Entry #:</label><label>Date:</label><label>Description:</label><docket.entry><number.block><number>28</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|0450912204;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A1-280450912204" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|0450912204;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>07/19/2007</date><docket.description>ORDER GRANTING DEFENDANTS&apos; MOTION TO DISMISS AND DENYING PLAINTIFF&apos;S MOTION FOR LEAVE TO FILE A SECOND AMENDED COMPLAINT. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1, ) (ENTERED: 07/19/2007)</docket.description></docket.entry><docket.entry><number.block><number>27</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501909813;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A2-2704501909813" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501909813;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>07/19/2007</date><docket.description>MEMORANDUM OPINION. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1) MODIFIED ON 7/19/2007 (LCRWR1, ). (ENTERED: 07/19/2007)</docket.description></docket.entry><docket.entry><number.block><number>26</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501672579;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A4-2604501672579" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501672579;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>03/31/2007</date><docket.description>MEMORANDUM ORDER GRANTING DEFENDANTS&apos; MOTION<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|0450561212;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">21</image.gateway.link><gateway.image.link ID="B3-21-0450561212" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|0450561212;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">21</gateway.image.link> TO STAY DISCOVERY PENDING RESOLUTION OF DEFENDANTS&apos; DISPOSITIVE MOTION FILED BY PATRICK J. CANAVAN, PAUL E. WATERS. SIGNED BY JUDGE RICHARD W. ROBERTS ON 3/31/07. (LCRWR1) ADDITIONAL ATTACHMENT(S) ADDED ON 4/3/2007 (LCRWR1, ). (ENTERED: 04/02/2007)</docket.description></docket.entry><docket.entry><number.block><number>25</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501577842;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A6-2504501577842" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501577842;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>11/15/2005</date><docket.description>RESPONSE TO DEFENDANTS&apos; NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|04511581037;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">1</image.gateway.link><gateway.image.link ID="B5-1-04511581037" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|04511581037;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">1</gateway.image.link> EXHIBIT 1 - NOTICE OF APPEAL)(WISE, RICHARD) (ENTERED: 11/15/2005)</docket.description></docket.entry><docket.entry><number.block><number>24</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501579104;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A8-2404501579104" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501579104;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>11/14/2005</date><docket.description>NOTIFICATION OF SUPPLEMENTAL AUTHORITY BY DISTRICT OF COLUMBIA, PATRICK J. CANAVAN, PAUL E. WATERS (ATTACHMENTS: #<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|04511577643;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">1</image.gateway.link><gateway.image.link ID="B7-1-04511577643" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|04511577643;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">1</gateway.image.link>)(MULLEN, MARTHA) (ENTERED: 11/14/2005)</docket.description></docket.entry></docket.entries.block>
</n-extract-response>

当我在上面粘贴的特定摘录上运行我的Python脚本时,它会获得缺少的元素。但是当我在整个XML文件上运行脚本时,它没有,如前所示。很明显,摘录缺少了它上面和下面的很多元素,但是我没有看到它会如何影响iter()函数,因为我没有拆分“docket.entry”元素/子元素,这就是我的代码中的for循环应该经历的每次(我认为)。

问题不仅限于条目号25 - 还有一些其他提取的文档描述在这里和那里缺少一个子元素,但我无法辨别任何模式 - 我甚至无法分辨条目号25和条目号24之间的差异引起了问题。有人可以帮忙吗?

3 个答案:

答案 0 :(得分:1)

您尝试在开始事件中处理元素的子元素,但iterparse的工作方式,并不能保证它们已被读取。

documentation有一个关于此的说明:

  

注意:

     

iterparse()只保证它在发出“start”事件时看到了起始标记的“&gt;”字符,因此定义了属性,但是text和tail属性的内容未定义点。这同样适用于儿童元素;他们可能会或可能不会在场。

     

如果您需要完全填充的元素,请改为寻找“结束”事件。

如果您想处理元素子元素,您需要在结束事件上执行它,否则无法保证元素的哪些内容可用。

您可以完全了解任何内容的原因here

  

注意:

     

树构建器和事件生成器不一定是同步的;后者通常落后一点。这意味着当您获得元素的“开始”事件时,构建器可能已经使用内容填充了该元素。但是,您不能依赖于此 - “开始”事件只能用于检查属性,而不是元素内容。有关详细信息,请参阅this message

答案 1 :(得分:0)

从版本2.7开始不推荐使用getchildren:使用list(elem)或iteration。

答案 2 :(得分:-1)

也许你可以选择根据它的逻辑顺序解析xml文件,这样你就可以准确地控制每个元素。 E.g。

import xml.etree.ElementTree as ET

tree = ET.parse(r'<xml file name>')
root = tree.getroot()
docket_entries = root.findall('.//docket.entry')
for entry in docket_entries:
    number = entry.find('.//number')
    print number.text
    description = entry.find('docket.description')
    print description.text
    for child in description.getchildren():
        print child