用regex替换所有html标记属性

时间:2017-02-24 11:46:04

标签: python html regex python-2.7

我试图找出如何向id=ID_<number>代码段中的所有代码添加属性html并删除其他属性。

例如:

<div class="...">...</div>

为:

<div id="DIV_1">...</div>

DIV是大写的标记名称,_1表示排序。因此,如果此<div>是第二个标记,则它将具有DIV_2个ID。排序是DFS意思,所以如果<div id="DIV_2">..</div>有一些像<div id="DIV_2"><ul class=".." style="..">...</ul></div>这样的孩子,ul标记会有id:UL_3

我尝试找到所有标签,然后删除其属性并逐个添加ID。

re.findall(r'<([a-z][a-z0-9]*)\b[^>]*>',snippet)

找到所有标签。我的想法是:

for i,tag in enumerate(tags):

    remove_all_attributes_from_tag
    get name of the tag and add set attribute "{}_{}".format(tag_name.upper,i)

但无法弄清楚如何继续。

摘录:

<div id="wtab" class="pd_cont" style="display: table;"><div class="pd_colmn"><h4>Display</h4><span>5.20-inch</span></div><div class="pd_colmn"><h4>Processor</h4><span>2GHz octa-core</span></div><div class="pd_colmn"><h4>Front Camera</h4><span>8-megapixel</span></div><div class="pd_colmn"><h4>Resolution</h4><span>1080x1920 pixels</span></div><div class="pd_colmn"><h4>RAM</h4><span>3GB</span></div><div class="pd_colmn"><h4>OS</h4><span>Android 6.0</span></div><div class="pd_colmn"><h4>Storage</h4><span>32GB</span></div><div class="pd_colmn"><h4>Rear Camera</h4><span>16-megapixel</span></div><div class="pd_colmn"><h4>Battery Capacity</h4><span>2650mAh</span></div></div>

1 个答案:

答案 0 :(得分:1)

首先用id结构和唯一标识符替换所有标记属性。第二步,在循环中逐个替换唯一标识符。

<强>代码

import re
html_orig = '<div id="wtab" class="pd_cont" style="display: table;"><div class="pd_colmn"><h4>Display</h4><span>5.20-inch</span></div><div class="pd_colmn"><h4>Processor</h4><span>2GHz octa-core</span></div><div class="pd_colmn"><h4>Front Camera</h4><span>8-megapixel</span></div><div class="pd_colmn"><h4>Resolution</h4><span>1080x1920 pixels</span></div><div class="pd_colmn"><h4>RAM</h4><span>3GB</span></div><div class="pd_colmn"><h4>OS</h4><span>Android 6.0</span></div><div class="pd_colmn"><h4>Storage</h4><span>32GB</span></div><div class="pd_colmn"><h4>Rear Camera</h4><span>16-megapixel</span></div><div class="pd_colmn"><h4>Battery Capacity</h4><span>2650mAh</span></div></div>'
html_edit = re.sub('(<[\w\d]+)(\s?[\w\d\s=;"_:]*)(>)',
                   '\g<1> id="DIV_!id!\g<3>', html_orig)
i = 1
while True:
    sub = re.subn('!id!', str(i), html_edit, count=1)
    if sub[1] == 0:
        break
    html_edit = sub[0]
    i += 1

re.subn()返回包含subs数的元组,这将启用循环的中断条件。

<强>结果

'<div id="DIV_1><div id="DIV_2><h4 id="DIV_3>Display</h4><span id="DIV_4>5.20-inch</span></div><div id="DIV_5><h4 id="DIV_6>Processor</h4><span id="DIV_7>2GHz octa-core</span></div><div id="DIV_8><h4 id="DIV_9>Front Camera</h4><span id="DIV_10>8-megapixel</span></div><div id="DIV_11><h4 id="DIV_12>Resolution</h4><span id="DIV_13>1080x1920 pixels</span></div><div id="DIV_14><h4 id="DIV_15>RAM</h4><span id="DIV_16>3GB</span></div><div id="DIV_17><h4 id="DIV_18>OS</h4><span id="DIV_19>Android 6.0</span></div><div id="DIV_20><h4 id="DIV_21>Storage</h4><span id="DIV_22>32GB</span></div><div id="DIV_23><h4 id="DIV_24>Rear Camera</h4><span id="DIV_25>16-megapixel</span></div><div id="DIV_26><h4 id="DIV_27>Battery Capacity</h4><span id="DIV_28>2650mAh</span></div></div>'