Python从精确位置获取HTML元素/节点/标记

时间:2014-08-18 19:00:46

标签: python html python-3.x

我有一个很长的html文档,我知道其中某些文本的确切位置。例如:

<html>
  <body>
    <div>
      <a>
        <b>
          I know the exact position of this text
        </b>
        <i>
          Another text
        </i>
      </a>
    </div>
  </body>
</html>

我知道这句话“我知道这段文字的确切位置”从字母编号'x'开始,到字符编号'y'结束。但是我必须得到整个标记/ node / element,它保存了这个值。可能有几个是它的祖先。

我如何轻松处理它?<​​/ p>

//修改

要明确说明 - 我唯一得到的是一个整数值,它描述了句子的开头。

例如 - 2048。

我不能假设任何有关文档结构的内容。从某些方面开始,我必须在整个节点中由祖先进行祖先。

即使是位置(2048)指出的句子也不一定是唯一的。

2 个答案:

答案 0 :(得分:1)

假设<b>在此实例中是唯一的,您可以将XPathxml.etree.elementtree一起使用。

import xml.etree.elementtree as ET
tree = ET.parse('xmlfile')
root = tree.get(root)
myEle = root.findall(".//*[b]")

myEle现在将保留对'b'的父级的引用,在本例中为'a'。

如果您只想要b元素,那么您可以这样做:

myEle = root.findall(".//b")

如果你想要a的孩子,你可以做几件不同的事情:

myEle = root.findall(".//a//")
myEle = root.findall('.//*[a]//*')[1:]

有关XPath的更多信息,请查看此处:XPath

答案 1 :(得分:0)

您可以将整个HTML文档的内容作为字符串阅读。然后,您可以使用标记(具有唯一ID的HTML锚元素)获取修改后的字符串,并使用xml.etree.ElementTree解析字符串,就像标记位于原始文档中一样。然后,您可以使用XPath找到标记的父元素,并删除辅助标记。结果包含结构,就像解析原始doc一样。但是现在你知道带有文字的元素了!

警告:您必须知道位置是字节位置还是抽象字符位置。 (想想多字节编码或编码某些字符的非固定长度的序列。还要考虑行结尾 - 一个或两个字节。)

尝试使用Windows行结尾将问题中的示例存储在data.html中的示例:

#!python3

import xml.etree.ElementTree as ET

fname = 'doc.html'
pos = 64

with open(fname, encoding='utf-8') as f:
    content = f.read()

# The position_id will be used in XPath, the position_anchor
# uses the variable only for readability. The position anchor
# has the form of an HTML element to be found easily using 
# the XPath expression.
position_id = 'my_unique_position_{}'.format(pos)
position_anchor = '<a id="{}" />'.format(position_id)

# The modified content has one extra anchor as the position marker.
modified_content = content[:pos] + position_anchor + content[pos:]

root = ET.fromstring(modified_content)
ET.dump(root)
print('----------------')

# Now some examples for getting the info around the point.
# '.' = from here; '//' = wherever; 'a[@id=...]' = anchor (a) element
# with the attribute id with the value. 
# We will not use it later -- only for demonstration.
anchor_element = root.find('.//a[@id="{}"]'.format(position_id))
ET.dump(anchor_element)
print('----------------')

# The text at the original position -- the text became the tail 
# of the element.
print(repr(anchor_element.tail))
print('================')

# Now, from scratch, get the nearest parent from the position.
parent = root.find('.//a[@id="{}"]/..'.format(position_id))
ET.dump(parent)
print('----------------')

# ... and the anchor element (again) as the nearest child
# with the attributes.
anchor = parent.find('./a[@id="{}"]'.format(position_id))
ET.dump(anchor)
print('----------------')

# If the marker split the text, part of the text belongs to 
# the parent, part is the tail of the anchor marker.
print(repr(parent.text))
print(repr(anchor.tail))
print('----------------')

# Modify the parent to remove the anchor element (to get
# the original structure without the marker. Do not forget
# that the text became the part of the marker element as the tail.
parent.remove(anchor)
parent.text += anchor.tail
ET.dump(parent)
print('----------------')

# The structure of the whole document now does not contain 
# the added anchor marker, and you get the reference
# to the nearest parent.
ET.dump(root)
print('----------------')

它打印以下内容:

c:\_Python\Dejwi\so25370255>a.py
<html>
  <body>
    <div>
      <a>
        <b>
          I know<a id="my_unique_position_64" /> the exact position of this text

        </b>
        <i>
          Another text
        </i>
      </a>
    </div>
  </body>
</html>
----------------
<a id="my_unique_position_64" /> the exact position of this text

----------------
' the exact position of this text\n        '
================
<b>
          I know<a id="my_unique_position_64" /> the exact position of this text

        </b>

----------------
<a id="my_unique_position_64" /> the exact position of this text

----------------
'\n          I know'
' the exact position of this text\n        '
----------------
<b>
          I know the exact position of this text
        </b>

----------------
<html>
  <body>
    <div>
      <a>
        <b>
          I know the exact position of this text
        </b>
        <i>
          Another text
        </i>
      </a>
    </div>
  </body>
</html>
----------------