在DOM中,如何查找给定元素剩余的最右边元素并使用lxml或xpath匹配条件

时间:2012-05-14 16:11:35

标签: html dom xpath lxml tree-search

我正在处理一个函数,该函数确定lxml ElementTree中给定html元素的内容是否是呈现的HTML页面中一行的主要内容。为此,我试图找到el剩下的最右边的块级元素,然后确定这两者之间是否有内容。

我认为这可以通过与DFS相反的顺序进行遍历,反向遍历从el开始。但我也一直试图找到一个更简单的方法,使用lxml或xpath来做到这一点。到目前为止,我已经找到了一些方法来查找具有某些标准的给定元素的祖先或兄弟姐妹的元素,但我还没有发现任何在特定节点的右侧(或左侧)的整个树上起作用的东西。

有人知道使用lxml或xpath进行搜索的更简单方法吗?

示例

<html>
<body class="first">
root
<!-- A span that does not have its own content, but does have several levels of children-->
<span>
  <a>
    <b>
      <h1 class="first">
        A block level that is the decendant of several non block levels
      </h1>
    </b>
  </a>
  <span class="first" id="tricky">
    A non-block level that has no block levels among its ancestors, but a block level element among its left cousins
  </span>
  <span>
    A non-block level that has no block levels among its ancestors, and content between itself and its nearest left-cousin block level
  </span>
</span>
<div class="first">
a block level
</div>
<div>
<span class="first">first content in a non block level in a block level</span>
<span>following content in a non block level in a block level</span>
</div>
<div>
  <i>  </i><bclass="first">a non block level that contains the first content within a block level, but follows an empty non-block level</b>
</div>
</body>
</html>

在上面我添加了一个&#34;第一个&#34;类到任何元素,当渲染时,它们似乎呈现一行的主要内容。特别感兴趣的是具有id&#34;棘手&#34;的元素,因为该元素将呈现一行的第一内容,即使它的祖先及其兄弟都不是块级元素。 &#34;棘手&#34;将在一个新行上,因为其中一个兄弟的后代(h1)是块级别,并且h1之后没有其他内容。

跟进 在这一点上,我在Python中编写了一个函数,它执行一种向后遍历。它有点复杂,但似乎有效:

block_level = {'blockquote','br','dd','div','dl','dt','h1','h2','h3','h4','h5','h6','hr','li','ol','p','pre','td','ul'}

# Returns true if the content of the provided element is the leading content of a line
# This function runs on HTML elements before any translation occurs
# Here 'content' refers to non-whiespace characters
def is_first_in_line_html(self, el):
    # This element contains no content, so it can't be the leading content of a line.
    if el.text is None or el.text.strip() == '': return False

    # This element has content and is a block level, so its content is the leading content of a line.
    if el.tag in block_level: return True

    # This element has content, is not a block level, and is the body element. Definitely leading content of a line.
    if el.tag == 'body': return True

    # Final case - is there content between the present element and the nearest block level element to the left of the present
    # element.    

    def traverse_children(element, bound_text):
        children = element.iterchildren(reversed=True)
        for child in children:
            if child.tail is not None: bound_text = child.tail + bound_text
            if bound_text.strip() != '': return False
            if child.tag in block_level: return bound_text.strip() == ''
            rst_children = traverse_children(child, bound_text)
            if rst_children is not None: return rst_children
            if child.text is not None: bound_text = child.text + bound_text
            if bound_text.strip() != '': return False
        return None

    def traverse_left_sibs_and_ancestors(element, bound_text):
        left_sibs = element.itersiblings(preceding=True)
        for sib in left_sibs:
            if sib.tail is not None: bound_text = sib.tail + bound_text
            if bound_text.strip() != '': return False
            if sib.tag in block_level: return bound_text.strip() == ''
            rst_children = traverse_children(sib, bound_text)
            if rst_children is not None: return rst_children
            if sib.text is not None: bound_text = sib.text + bound_text
            if bound_text.strip() != '': return False
        parent = element.getparent()
        if parent.tail is not None: bound_text = parent.tail + bound_text
        if parent.tag == 'body': return bound_text.strip() == ''
        if parent.tag in block_level: return bound_text.strip() == ''
        return traverse_left_sibs_and_ancestors(parent)

    return traverse_left_sibs_and_ancestors(el, '')

0 个答案:

没有答案