Question

使用beautifulsoup解析长篇复杂的html文档时，有时可以获得原始字符串中与元素匹配的确切位置。我不能简单地搜索字符串，因为可能存在多个匹配元素，我将失去bs4解析DOM的能力。鉴于这个最小的工作示例：

import bs4

html = "<div><b>Hello</b>  <i>World</i></div>"
soup = bs4.BeautifulSoup(html,'lxml')

# Returns 22
print html.find("World")

# How to get this to return 22?
print soup.find("i", text="World")

如何让bs4提取的元素返回22？

Answer 1

我了解您的问题是“世界”可能写过很多次，但是您想获得特定事件的位置（您以某种方式知道如何识别）。

您可以使用此替代方法。我敢打赌，还有更优雅的解决方案，但这应该可以做到：

给出此html：

import bs4

html = """<div><b>Hello</b>  <i>World</i></div>
          <div><b>Hello</b>  <i>Foo World</i></div>
          <div><b>Hello</b>  <i>Bar World</i></div>"""

soup = bs4.BeautifulSoup(html,'lxml')

如果我们想获得Foo World事件发生的位置，我们可以：

获取标签
介绍一些我们知道它在html其余部分中不存在的唯一字符串

获取我们添加的字符串的位置

import bs4

html = """<div><b>Hello</b>  <i>World</i></div>
          <div><b>Hello</b>  <i>Foo World</i></div>
          <div><b>Hello</b>  <i>Bar World</i></div>"""

soup = bs4.BeautifulSoup(html,'html.parser')

#1
desired_tag = soup.find("i", text="Foo World")
#2
desired_tag.insert(0, "some_unique_string")

print(str(soup))
"""
Will show:
<div><b>Hello</b> <i>World</i></div>
<div><b>Hello</b> <i>some_unique_stringFoo World</i></div>
<div><b>Hello</b> <i>Bar World</i></div>
"""

#3   
print(str(soup).find("some_unique_string"))
"""
58
"""

从beautifulsoup元素中提取原始字符串位置

1 个答案: