Question

我想为每个节点添加一个深度，为此我提出了以下递归函数：

import lxml.html

def add_depth(node, depth = 0):
    node.depth = depth
    print(node.tag, node.depth)
    for n in node.iterchildren(): 
        add_depth(n , depth + 1)

html = """<html>
            <body>
              <div>
                <a></a>
                <h1></h1>
              </div>
            </body>
          </html>"""

tree = lxml.html.fromstring(html)

add_depth(tree)

for x in tree.iter():
    print(x)
    if not hasattr(x, 'depth'):
        print('this should not happen', x)

我认为这是添加深度的最便宜的方法之一，因此，只需执行一次即可为所有元素提供深度，我只需要查看每个元素一次。

问题在于它似乎不会粘住....它就像深度不会粘在元素上。是不是可以某种方式迭代lxml树是当场生成的东西，因此添加深度不会粘住？

这里发生了什么，以及让所有元素具有深度的最便宜方式是什么？

突破

使用以下内容：

def add_depth(node, depth = 0, maxd = None):
    node.depth = depth
    if maxd is None:
        maxd = []
    maxd.append((node, node.depth)) 
    for n in node.iterchildren(): 
        add_depth(n , depth + 1, maxd)
    return maxd

突然间确实有效。这段代码创建了一个巨大的所有元素列表及其旁边的深度（所以我可以对它进行排序）。即使在原始树上进行迭代，这次他们做也有深度。但这根本没有效率，我不明白。

@ Maximoo

tree.depth = 0
for x in tree.iter(): 
    if x.getparent() is not None:
        x.depth = x.getparent().depth + 1

AttributeError: 'HtmlElement' object has no attribute 'depth'

Answer 1

这里有几个问题。

首先是你试图使你的递归函数具有更新原始树的副作用。我不认为这是可能。
第二个是你不需要使用Python属性使用您使用x.attrib访问的xml属性。

一段代码可能是以下（因为我不断地将字符串从int转换为int，所以它有点尴尬，因为xml属性不能是整数）。它不使用递归，但我认为这无论如何都是过度的：

tree.attrib['depth'] = '0'
for x in tree.iter():
    if 'depth' not in x.attrib:
        x.attrib['depth'] = str(int(x.getparent().attrib['depth']) + 1)


print(lxml.html.tostring(tree).decode())

<html depth="0">
            <body depth="1">
              <div depth="2">
                <a depth="3"></a>
                <h1 depth="3"></h1>
              </div>
            </body>
          </html>

在迭代lxml树

突破

1 个答案: