我试图编写一个小函数来将HTML文档的隐式部分包装到section标签中。我试图用lxml.etree。
这样做我的输入是:
<html>
<head></head>
<body>
<h1>title</h1>
<p>some text</p>
<h1>title</h1>
<p>some text</p>
</body>
</html>
我想最终:
<html>
<head></head>
<body>
<section>
<h1>title</h1>
<p>some text</p>
</section>
<section>
<h1>title</h1>
<p>some text</p>
</section>
</body>
</html>
这是我目前所拥有的
def outline(tree):
pattern = re.compile('^h(\d)')
section = None
for child in tree.iterchildren():
tag = child.tag
if tag is lxml.etree.Comment:
continue
match = pattern.match(tag.lower())
# If a header tag is found
if match:
depth = int(match.group(1))
if section is not None:
child.addprevious(section)
section = lxml.etree.Element('section')
section.append(child)
else:
if section is not None:
section.append(child)
else:
pass
if child is not None:
outline(child)
我称之为
outline(tree.find('body'))
但目前它不适用于副标题,例如:
<section>
<h1>ONE</h1>
<section>
<h3>TOO Deep</h3>
</section>
<section>
<h2>Level 2</h2>
</section>
</section>
<section>
<h1>TWO</h1>
</section>
由于
答案 0 :(得分:1)
在转换xml时,xslt是最好的方法,请参阅lxml and xslt docs。
这只是请求的方向,如果您需要进一步帮助编写xslt
,请告诉我答案 1 :(得分:0)
这是我最终得到的代码,用于记录:
def outline(tree, level=0):
pattern = re.compile('^h(\d)')
last_depth = None
sections = [] # [header, <section />]
for child in tree.iterchildren():
tag = child.tag
if tag is lxml.etree.Comment:
continue
match = pattern.match(tag.lower())
#print("%s%s" % (level * ' ', child))
if match:
depth = int(match.group(1))
if depth <= last_depth or last_depth is None:
#print("%ssection %d" % (level * ' ', depth))
last_depth = depth
sections.append([child, lxml.etree.Element('section')])
continue
if sections:
sections[-1][1].append(child)
for section in sections:
outline(section[1], level=((level + 1) * 4))
section[0].addprevious(section[1])
section[1].insert(0, section[0])
对我来说效果很好