Python XML解析正则表达式

时间:2015-02-26 16:49:08

标签: python regex xml python-2.7 xml-parsing

对象:将文件中的字符串与XML中的字符串相匹配。更换 与评论匹配

cat File.txt

RHO_BID_RT
RHO_ASK_RT

XML文件内容

<field name="RHO_BID_RT" type="float" id="0x01D3" sequence="1"/>
<field name="RHO_ASK_RT" type="float" id="0x01D4" sequence="1"/>

XML内容中的预期结果

 <!-- Removed RHO_BID_RT-->
 <!-- Removed RHO_ASK_RT-->

CODE

import re

word_file = 'File.txt'
xml_file  = '../file.xml'

with open(word_file) as words:
    regex = r'<[^>]+ *field name="({})"[^>]+>'.format(
        '|'.join(word.rstrip() for word in words)
    )

with open(xml_file) as xml:
    for line in xml:
        line = re.sub(regex, r'<!!-- REMOVED \1 -->', line.rstrip())
        print(line)

1 个答案:

答案 0 :(得分:1)

使用 XML解析器,例如lxml

这个想法是读取一个单词列表并构造一个xpath表达式,它将name属性检查为这些单词之一。然后,通过调用replace()

替换元素
from lxml import etree
from lxml.etree import Comment


with open('words.txt') as f:
    words = [line.strip() for line in f]

xpath = '//field[{}]'.format(" or ".join(['@name = "%s"' % word for word in words]))

tree = etree.parse('input.xml')
root = tree.getroot()

for element in tree.xpath(xpath):
    root.replace(element, Comment('REMOVED'))

print etree.tostring(tree)

对于input.xml的以下内容:

<fields>
    <field name="RHO_BID_RT" type="float" id="0x01D3" sequence="1"/>
    <field name="RHO_ASK_RT" type="float" id="0x01D4" sequence="1"/>
</fields>

words.txt

RHO_BID_RT
RHO_ASK_RT

打印:

<fields>
    <!--REMOVED-->
    <!--REMOVED-->
</fields>

或者,构造一组单词并在循环中检查name属性值:

from lxml import etree
from lxml.etree import Comment


with open('words.txt') as f:
    words = set([line.strip() for line in f])

tree = etree.parse('input.xml')
root = tree.getroot()

for element in tree.xpath('//field[@name]'):
    if element.attrib['name'] in words:
        root.replace(element, Comment('REMOVED'))

print etree.tostring(tree)