从XML改进文本提取例程

时间:2010-09-29 04:24:36

标签: python xml xml-parsing

我有一个包含no的XML文件。封闭文本的<TEXT> </TEXT>标签。

<TEXT>

<!-- PJG STAG 4703 -->

<!-- PJG ITAG l=94 g=1 f=1 -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=69 g=1 f=1 -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=50 g=1 f=1 -->


<USDEPT>DEPARTMENT OF AGRICULTURE</USDEPT>

<!-- PJG /ITAG -->

<!-- PJG ITAG l=18 g=1 f=1 -->

<USBUREAU>Packers and Stockyards Administration</USBUREAU>
<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=55 g=1 f=1 -->
Amendment to Certification of Central Filing System_Oklahoma
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=11 g=1 f=1 -->
The Statewide central filing system of Oklahoma has been previously certified, pursuant to section 1324 of the Food
Security Act of 1985, on the basis of information submitted by Hannah D. Atkins, Secretary of State, for farm products
produced in that State (52 FR 49056, December 29, 1987).
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->
The certification is hereby amended on the basis of information submitted by John Kennedy, Secretary of State, for
additional farm products produced in that State as follows: Cattle semen, cattle embryos, milo.
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->
This is issued pursuant to authority delegated by the Secretary of Agriculture.
<!-- PJG /ITAG -->

<!-- PJG QTAG 04 -->
<!-- PJG /QTAG -->

<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG ITAG l=21 g=1 f=1 -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=21 g=1 f=4 -->
Authority:
<!-- PJG /ITAG -->

<!-- PJG ITAG l=21 g=1 f=1 -->
 Sec. 1324(c)(2), Pub. L. 99-198, 99 Stat. 1535, 7 U.S.C. 1631(c)(2); 7 CFR 2.18(e)(3), 2.56(a)(3), 55 FR 22795.
<!-- PJG /ITAG -->

<!-- PJG QTAG 02 -->
<!-- PJG /QTAG -->

<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG ITAG l=21 g=1 f=1 -->
Dated: January 21, 1994
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->

<SIGNER>
<!-- PJG ITAG l=06 g=1 f=1 -->
Calvin W. Watkins, Acting Administrator,
<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->
</SIGNER>
<SIGNJOB>
<!-- PJG ITAG l=04 g=1 f=1 -->
Packers and Stockyards Administration.
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->
</SIGNJOB>
<FRFILING>
<!-- PJG ITAG l=40 g=1 f=1 -->
[FR Doc. 94-1847 Filed 1-27-94; 8:45 am]
<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->
</FRFILING>
<BILLING>
<!-- PJG ITAG l=68 g=1 f=1 -->
BILLING CODE 3410-KD-P
<!-- PJG /ITAG -->
</BILLING>

<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG /STAG -->
</TEXT>

我的任务是从每个TEXT节点中提取文本。这就是我正在做的事情:

def getTextFromXML():
    global Text, xmlDoc
    TextNodes = xmlDoc.getElementsByTagName("TEXT")
    docstr = ''
    #Text = [TextFromNode(textNode) for textNode in TextNodes]
    for textNode in TextNodes:
        for cNode in textNode.childNodes:
            if cNode.nodeType == Node.TEXT_NODE:
                docstr+=cNode.data
            else:
                for ccNode in cNode.childNodes:
                    if ccNode.nodeType == Node.TEXT_NODE:
                        docstr+=ccNode.data                
        Text.append(docstr)

问题在于它耗费了大量时间。我想我的功能效率不高。任何人都可以建议我如何改进它吗?

编辑:我正在处理的文件包含大约6000 + <TEXT>个文本元素。

2 个答案:

答案 0 :(得分:1)

lxml比标准python库中包含的xml库更容易使用。它是C libxml2库的绑定,所以我假设它也更快。

我会做这样的事情(使用你的变量名):

from lxml import etree
with open('some-file.xml') as f:
    xmlDoc = etree.parse(f)
    root = xmlDoc.getroot()

    Text = []
    for textNode in root.xpath('TEXT'):
        docstr = '\n'.join(text.strip() for text in textNode.xpath('*/text() | text()') if text.strip())
        Text.append(docstr)

答案 1 :(得分:0)

如果使用lxml(或Python 2.7中的xml.etree),则可以对元素使用.itertext()方法,例如:

s = ''.join(elem.itertext())

使用lxml,您可能也可以使用string() xpath函数(可能更快,因为所有工作都由libxml2本身完成,而不是在python中完成):

s = elem.xpath('string()')