python元素树iterparse过滤节点和子节点

时间:2015-01-31 15:02:15

标签: python iterparse celementtree

我正在尝试使用elementTree的iterparse函数根据文本过滤节点并将它们写入新文件。我使用iterparse因为输入文件很大(100+ MB)

input.xml中

<xmllist>
        <page id="1">
        <title>movie title 1</title>
        <text>this is a moviein theatres/text>
        </page>
        <page id="2">
        <title>movie title 2</title>
        <text>this is a horror film</text>
        </page>
        <page id="3">
        <title></title>
        <text>actor in film</text>
        </page>
        <page id="4">
        <title>some other topic</title>
        <text>nothing related</text>
        </page>
</xmllist>

预期输出(文字中包含“电影”或“电影”的所有页面)

<xmllist>
        <page id="1">
        <title>movie title 1</title>
        <text>this is a movie<n theatres/text>
        </page>
        <page id="2">
        <title>movie title 2</title>
        <text>this is a horror film</text>
        </page>
        <page id="3">
        <title></title>
        <text>actor in film</text>
        </page>
</xmllist>

当前代码

import xml.etree.cElementTree as etree
from xml.etree.cElementTree import dump

output_file=open('/tmp/outfile.xml','w')

for event, elem in iter(etree.iterparse("/tmp/test.xml", events=('start','end'))):
    if event == "end" and elem.tag == "page": #need to add condition to search for strings
        output_file.write(elem)
        elem.clear()

如何根据页面的文本属性添加正则表达式进行过滤?

1 个答案:

答案 0 :(得分:2)

你正在寻找一个孩子,而不是一个属性,所以最简单的方法是分析标题,因为它在迭代中“经过”并记住结果,直到你得到结果页面的结尾:

import re

good_page = False
for event, elem in iter(etree.iterparse("/tmp/test.xml", events=('start','end'))):
    if event == 'end':
        if elem.tag = 'title':
            good_page = re.search(r'film|movie', elem.text)
        elif elem.tag == 'page':
            if good_page:
                output_file.write(elem)
            good_page = False
            elem.clear()

re.search如果找不到则会返回None,而if会将其视为错误,因此我们会避免编写没有标题的网页以及标题为文字与您想要的RE不符。