Question

我正在使用Python处理HTML字符串。我想要从html字符串的给定文本（已知文本的开始和结束偏移）中找到父标记。

例如考虑以下html字符串

<html><body><span id="1234">The Dormouse's story</span><body></head>

输入偏移量（33,43），即字符串'Dormouse's，父标记为<span id="1234">

Answer 1

就在这里，因为您有偏移量（我认为您可能需要调整，因为我必须使用（28,48）），

基于偏移量创建一个子字符串。
使用split()（使用偏移字符串作为定界符）分割完整的html字符串。
采用拆分创建的第一个子字符串，并用>拆分。

该子字符串列表中倒数第二个子字符串是您的父标记（因为如果分隔符位于您要拆分的字符串的末尾，则拆分列表将返回一个空字符串）：

 html_string = '<html><body><span id="1234">The Dormouse\'s story</span><body></head>'
 offset_string = html_string[28:48]
 tags_together = html_string.split(offset_string)[0]
 list_of_tags = tags_together.split('>')
 parent_tag = list_of_tags[len(list_of_tags)-2]

请注意，您将缺少'>'，因此如有必要，您必须将其添加回去。

parent_tag = parent_tag + ">"

我之所以将html_string放在单引号中是因为您已经在其中将双引号引起来。

这很粗糙，有点残酷，但是应该可以完成工作。 I am sure there exists a python library out there that can do this kind of task for you. You just need to look hard enough!

我建议打开python shell，并在创建每个变量后将其打印出来，以便可以看到split()的作用。 Here are some docs for that!

现在我考虑一下，使用具有已知偏移量的正则表达式也可以获取标签...

从html字符串中的位置查找给定文本的父标签

1 个答案: