我试图找出如何向id=ID_<number>
代码段中的所有代码添加属性html
并删除其他属性。
例如:
<div class="...">...</div>
为:
<div id="DIV_1">...</div>
DIV
是大写的标记名称,_1表示排序。因此,如果此<div>
是第二个标记,则它将具有DIV_2
个ID。排序是DFS意思,所以如果<div id="DIV_2">..</div>
有一些像<div id="DIV_2"><ul class=".." style="..">...</ul></div>
这样的孩子,ul
标记会有id:UL_3
。
我尝试找到所有标签,然后删除其属性并逐个添加ID。
re.findall(r'<([a-z][a-z0-9]*)\b[^>]*>',snippet)
找到所有标签。我的想法是:
for i,tag in enumerate(tags):
remove_all_attributes_from_tag
get name of the tag and add set attribute "{}_{}".format(tag_name.upper,i)
但无法弄清楚如何继续。
摘录:
<div id="wtab" class="pd_cont" style="display: table;"><div class="pd_colmn"><h4>Display</h4><span>5.20-inch</span></div><div class="pd_colmn"><h4>Processor</h4><span>2GHz octa-core</span></div><div class="pd_colmn"><h4>Front Camera</h4><span>8-megapixel</span></div><div class="pd_colmn"><h4>Resolution</h4><span>1080x1920 pixels</span></div><div class="pd_colmn"><h4>RAM</h4><span>3GB</span></div><div class="pd_colmn"><h4>OS</h4><span>Android 6.0</span></div><div class="pd_colmn"><h4>Storage</h4><span>32GB</span></div><div class="pd_colmn"><h4>Rear Camera</h4><span>16-megapixel</span></div><div class="pd_colmn"><h4>Battery Capacity</h4><span>2650mAh</span></div></div>
答案 0 :(得分:1)
首先用id结构和唯一标识符替换所有标记属性。第二步,在循环中逐个替换唯一标识符。
<强>代码强>
import re
html_orig = '<div id="wtab" class="pd_cont" style="display: table;"><div class="pd_colmn"><h4>Display</h4><span>5.20-inch</span></div><div class="pd_colmn"><h4>Processor</h4><span>2GHz octa-core</span></div><div class="pd_colmn"><h4>Front Camera</h4><span>8-megapixel</span></div><div class="pd_colmn"><h4>Resolution</h4><span>1080x1920 pixels</span></div><div class="pd_colmn"><h4>RAM</h4><span>3GB</span></div><div class="pd_colmn"><h4>OS</h4><span>Android 6.0</span></div><div class="pd_colmn"><h4>Storage</h4><span>32GB</span></div><div class="pd_colmn"><h4>Rear Camera</h4><span>16-megapixel</span></div><div class="pd_colmn"><h4>Battery Capacity</h4><span>2650mAh</span></div></div>'
html_edit = re.sub('(<[\w\d]+)(\s?[\w\d\s=;"_:]*)(>)',
'\g<1> id="DIV_!id!\g<3>', html_orig)
i = 1
while True:
sub = re.subn('!id!', str(i), html_edit, count=1)
if sub[1] == 0:
break
html_edit = sub[0]
i += 1
re.subn()
返回包含subs数的元组,这将启用循环的中断条件。
<强>结果强>
'<div id="DIV_1><div id="DIV_2><h4 id="DIV_3>Display</h4><span id="DIV_4>5.20-inch</span></div><div id="DIV_5><h4 id="DIV_6>Processor</h4><span id="DIV_7>2GHz octa-core</span></div><div id="DIV_8><h4 id="DIV_9>Front Camera</h4><span id="DIV_10>8-megapixel</span></div><div id="DIV_11><h4 id="DIV_12>Resolution</h4><span id="DIV_13>1080x1920 pixels</span></div><div id="DIV_14><h4 id="DIV_15>RAM</h4><span id="DIV_16>3GB</span></div><div id="DIV_17><h4 id="DIV_18>OS</h4><span id="DIV_19>Android 6.0</span></div><div id="DIV_20><h4 id="DIV_21>Storage</h4><span id="DIV_22>32GB</span></div><div id="DIV_23><h4 id="DIV_24>Rear Camera</h4><span id="DIV_25>16-megapixel</span></div><div id="DIV_26><h4 id="DIV_27>Battery Capacity</h4><span id="DIV_28>2650mAh</span></div></div>'