我知道这个问题之前已被问过,我知道Regex不适合管理XML。虽然,我在一个带有re.compile(...).subn(...)
的python实用程序中使用它们来替换某些XML内容而不解析XML本身,因为这个XML内容在专有/遗留语言文件中。因此,XML工具不是一种选择,在考虑编写特定算法之前,正则表达式是最后的手段。
我需要替换元素中包含的东西(属性值)。 例如,来自:
<Tag>
bla bla
<SomethingElse AnAttribute="YEAH"/>
bla bla
</Tag>
要:
<Tag>
bla bla
<SomethingElse AnAttribute="AH,NO!!"/>
bla bla
</Tag>
为了执行匹配,我尝试了简单模式(使用非贪婪的运算符):
<Tag>(.*?)AnAttribute="(.*?)"(.*?)</Tag>
那个非贪婪的算子+负向前瞻:
<Tag>(?!Tag)(.*?)CurrencyCode="(.*?)"(?!Tag)(.*?)</Tag>
它们都适用于简单的情况(最后一个处理&#34;误报&#34;),但我仍然无法处理以下(非常简单!)案例:
<Tag></Tag>
bla bla
<SomethingElse AnAttribute="YEAH"/>
bla bla
<Tag></Tag>
因为在这种情况下,AnAttribute
实际上已被发现(并且它不应该,因为它不在元素内部)!
答案 0 :(得分:0)
我认为您需要分两步解决问题:
使用带有作为参数传递的闭包的re.sub
可以很好地完成此操作。
以下使用partial
将额外参数传递给闭包,并动态构建所需的正则表达式:
import re
from functools import partial
text = u'''
[...]
'''
# The key is the external tag to extract
# The value a list of attributes whose content has to be replaced
sub_dict = {"RoomRatesWithoutServices": ['CurrencyCode1', 'CurrencyCode2'],
"AnotherTag": ['AnotherAttr']}
replacement = '_REPLACED_'
def closure(attr, replacement, m):
attr_pattern = '(?<=(?:%s)=")[^"]+(?=")' % attr
return re.sub(attr_pattern, replacement, m.group())
for ext_tag, attr_list in sub_dict.iteritems():
attr = r"|".join(attr_list)
tag_pattern = r"(?s)<%s>.*?</%s>" % (ext_tag, ext_tag)
text = re.sub(tag_pattern, partial(closure, attr, replacement), text)
print text
输出如下:
'<RoomRatesWithoutServices>&
</RoomRatesWithoutServices>&
<TotalBeforeTaxPayHotel AmountAfterTax="560.00" CurrencyCode="EUR"/>&
<TotalBeforeTaxPayHotel AmountAfterTax="560.00" CurrencyCode="EUR"/>&
<RoomRatesWithoutServices>&
</RoomRatesWithoutServices>&'
'<RoomRatesWithoutServices>&
<TotalBeforeTaxPayHotel AmountAfterTax="560.00" CurrencyCode1="_REPLACED_"/>&
<TotalBeforeTaxPayHotel AmountAfterTax="560.00" CurrencyCode2="_REPLACED_"/>&
<RoomRatesWithoutServices>&
</RoomRatesWithoutServices>&'
'<RoomRatesWithoutServices>&
</RoomRatesWithoutServices>&
<TotalBeforeTaxPayHotel AmountAfterTax="560.00" CurrencyCode="EUR"/>&
<TotalBeforeTaxPayHotel AmountAfterTax="560.00" CurrencyCode="EUR"/>&
<RoomRatesWithoutServices>&
</RoomRatesWithoutServices>&'
'<AnotherTag>&
<TotalBeforeTaxPayHotel AmountAfterTax="560.00" AnotherAttr="_REPLACED_"/>&
<TotalBeforeTaxPayHotel AmountAfterTax="560.00" AnotherAttr="_REPLACED_"/>&
<AnotherTag>&
</AnotherTag>&'
尝试在线DEMO