是否存在使用Python ElementTree降低HTLM树中所有标题级别的递归方法? 在下面的例子中,h1将变为h2,因此其他标题也是如此。
#! /usr/bin/env python
import html5lib
import xml.etree.ElementTree as ET
headings = '''<h1>Title</h1>
<h2>Sub Title</h2>
<h3>Sub sub title 1</h3>
<h3>Sub sub title 2</h3>
<h4>Sub sub sub title<h4>
<h3>Sub sub title</h3>
'''
tree = html5lib.parse(headings, namespaceHTMLElements=False)
答案 0 :(得分:2)
这是一个工作示例,但使用了很棒的BeautifulSoup
库:
import re
from bs4 import BeautifulSoup
headings = '''<h1>Title</h1>
<h2>Sub Title</h2>
<h3>Sub sub title 1</h3>
<h3>Sub sub title 2</h3>
<h4>Sub sub sub title</h4>
<h3>Sub sub title</h3>
'''
soup = BeautifulSoup(headings, "html.parser")
pattern = re.compile(r"^h(\d)$")
for tag in soup.find_all(pattern):
tag.name = "h%d" % (int(pattern.match(tag.name).group(1)) + 1)
print(soup)
我们正在查找标签名称与^h(\d)$
模式匹配的所有元素(h
后跟一位数字; ^
表示字符串的开头$
- 结束)。然后,我们提取数字并将其增加一个并更新标签名称。
打印:
<h2>Title</h2>
<h3>Sub Title</h3>
<h4>Sub sub title 1</h4>
<h4>Sub sub title 2</h4>
<h5>Sub sub sub title</h5>
<h4>Sub sub title</h4>
答案 1 :(得分:1)
element.tag = newtag就可以了。所有需要做的就是在标题中添加一个值。
#! /usr/bin/env python
import html5lib
import xml.etree.ElementTree as ET
headings = '''<h1>Title</h1>
<h2>Sub Title</h2>
<h3>Sub sub title 1</h3>
<h3>Sub sub title 2</h3>
<h4>Sub sub sub title<h4>
<h3>Sub sub title</h3>
<p>paragrap</p>
'''
tree = html5lib.parse(headings, namespaceHTMLElements=False)
headings = [el for el in tree.findall('.//') if el.tag in ["h1","h2", "h3", "h4","h5","h6"]]
for h in headings:
newtag = h.tag[0]+ str(int(h.tag[-1])+1)
h.tag=newtag
print ET.tostring(headings)