Question

我试图编写一个小函数来将HTML文档的隐式部分包装到section标签中。我试图用lxml.etree。

这样做

我的输入是：

<html>
    <head></head>
    <body>
        <h1>title</h1>
        <p>some text</p>
        <h1>title</h1>
        <p>some text</p>
    </body>
</html>

我想最终：

<html>
    <head></head>
    <body>
        <section>
            <h1>title</h1>
            <p>some text</p>
        </section>
        <section>
            <h1>title</h1>
            <p>some text</p>
        </section>
    </body>
</html>

这是我目前所拥有的

def outline(tree):
    pattern = re.compile('^h(\d)')
    section = None

    for child in tree.iterchildren():
        tag = child.tag

        if tag is lxml.etree.Comment:
            continue

        match = pattern.match(tag.lower())

        # If a header tag is found
        if match:
            depth = int(match.group(1))

            if section is not None:
                child.addprevious(section)

            section = lxml.etree.Element('section')
            section.append(child)

        else:
            if section is not None:
                section.append(child)
            else:
                pass

        if child is not None:
            outline(child)

我称之为

 outline(tree.find('body'))

但目前它不适用于副标题，例如：

<section>
    <h1>ONE</h1>
    <section>
        <h3>TOO Deep</h3>
    </section>
    <section>
        <h2>Level 2</h2>
    </section>
</section>
<section>
    <h1>TWO</h1>
</section>

由于

Answer 1

在转换xml时，xslt是最好的方法，请参阅lxml and xslt docs。

这只是请求的方向，如果您需要进一步帮助编写xslt

，请告诉我

Answer 2

这是我最终得到的代码，用于记录：

def outline(tree, level=0):
    pattern = re.compile('^h(\d)')
    last_depth = None
    sections = [] # [header, <section />]

    for child in tree.iterchildren():
        tag = child.tag

        if tag is lxml.etree.Comment:
            continue

        match = pattern.match(tag.lower())
        #print("%s%s" % (level * ' ', child))

        if match:
            depth = int(match.group(1))

            if depth <= last_depth or last_depth is None:
                #print("%ssection %d" % (level * ' ', depth))
                last_depth = depth

                sections.append([child, lxml.etree.Element('section')])
                continue

        if sections:
            sections[-1][1].append(child)

    for section in sections:
        outline(section[1], level=((level + 1) * 4))
        section[0].addprevious(section[1])
        section[1].insert(0, section[0])

对我来说效果很好

使用lxml.etree将HTML文档的隐式部分包装到section标签中

2 个答案: