HTML文档变成文档大纲树

时间:2018-07-25 09:10:58

标签: php python html dom tree

在PHP或Python中,您将如何将HTML文档从野外变为树状结构,其中节点的属性将是文档的标题,下面的段落以及节点的父级将是带有节点的节点。高阶标题。

例如;用于h1 > h2 > h3 > h4 > h5 > h6 > ul的层次结构和以下文档:

<h1>This is heading 1</h1>
<h2>This is heading 2</h2>
<h3>This is heading 3</h3>
<h4>This is heading 4</h4>
    <p>This is a paragraph.</p>
    <p>This is another paragraph.</p>
<h5>This is heading 5</h5>
<h6>This is heading 6</h6>

<h3>Another heading 3</h3>
<h2>Another heading 2</h2>
<ul>
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ul>
<p>One more paragraph.</p>

输出树将类似于:

<root>
    <node>
        <element>This is heading 1</element>
        <body></body>
        <node>
            <element>This is heading 2</element>
            <body></body>
            <node>
                <element>This is heading 3</element>
                <body></body>
                <node>
                    <element>This is heading 4</element>
                    <body><![CDATA[
                        <p>This is a paragraph.</p>
                        <p>This is another paragraph.</p>
                       ]] >
                    </body>
                    <node>
                        <element>This is heading 5</element>
                        <body></body>
                        <node>
                            <element>This is heading 6</element>
                            <body></body>
                        </node>
                    </node>
                </node>
            </node>
            <node>
                <element>Another heading 3</element>
                <body></body>
            </node>
        <node>
            <element>Another heading 2</element>
            <body></body>
            <node>
                <element></element>
                <body><![CDATA[
                  <li>Coffee</li>
                  <li>Tea</li>
                  <li>Milk</li>
                 ]] >
                </body>
                <node>
                    <element></element>
                    <body><![CDATA[
                        <p>One more paragraph.</p>
                     ]] >
                    </body>
                </node>
            </node>
        </node>
        </node>
    </node>
</root>

输出不必是XML,它可以是对象(PHP或Python),可以使用next()previous()children()和{ {1}}

0 个答案:

没有答案