使用Python ElementTree减少html标题

时间:2016-03-14 14:16:27

标签: python html xml elementtree

是否存在使用Python ElementTree降低HTLM树中所有标题级别的递归方法? 在下面的例子中,h1将变为h2,因此其他标题也是如此。

#! /usr/bin/env python
import html5lib
import xml.etree.ElementTree as ET

headings = '''<h1>Title</h1>
<h2>Sub Title</h2>
<h3>Sub sub title 1</h3>
<h3>Sub sub title 2</h3>
<h4>Sub sub sub title<h4>
<h3>Sub sub title</h3>
'''
tree = html5lib.parse(headings, namespaceHTMLElements=False)

2 个答案:

答案 0 :(得分:2)

这是一个工作示例,但使用了很棒的BeautifulSoup库:

import re
from bs4 import BeautifulSoup

headings = '''<h1>Title</h1>
<h2>Sub Title</h2>
<h3>Sub sub title 1</h3>
<h3>Sub sub title 2</h3>
<h4>Sub sub sub title</h4>
<h3>Sub sub title</h3>
'''

soup = BeautifulSoup(headings, "html.parser")
pattern = re.compile(r"^h(\d)$")
for tag in soup.find_all(pattern):
    tag.name = "h%d" % (int(pattern.match(tag.name).group(1)) + 1)

print(soup)

我们正在查找标签名称与^h(\d)$模式匹配的所有元素(h后跟一位数字; ^表示字符串的开头$ - 结束)。然后,我们提取数字并将其增加一个并更新标签名称。

打印:

<h2>Title</h2>
<h3>Sub Title</h3>
<h4>Sub sub title 1</h4>
<h4>Sub sub title 2</h4>
<h5>Sub sub sub title</h5>
<h4>Sub sub title</h4>

答案 1 :(得分:1)

element.tag = newtag就可以了。所有需要做的就是在标题中添加一个值。

#! /usr/bin/env python
import html5lib
import xml.etree.ElementTree as ET

headings = '''<h1>Title</h1>
<h2>Sub Title</h2>
<h3>Sub sub title 1</h3>
<h3>Sub sub title 2</h3>
<h4>Sub sub sub title<h4>
<h3>Sub sub title</h3>
<p>paragrap</p>
'''

tree = html5lib.parse(headings, namespaceHTMLElements=False)
headings = [el for el in tree.findall('.//') if el.tag in ["h1","h2", "h3", "h4","h5","h6"]]

for h in headings:
    newtag =  h.tag[0]+ str(int(h.tag[-1])+1)
    h.tag=newtag

print ET.tostring(headings)