Question

BeautifulSoup是否提供了一种方法来获取标记或其文本在其来自的HTML字符串中的字符串索引？

例如：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'lxml')

有没有办法知道html_doc（soup.p）开始的The Dormouse's Story内的字符串索引？或者其文本（The Dormouse's story）开始的位置？

编辑：soup.p的预期索引为63，即html_doc.index('''The Dormouse's story''')。其文本的预期索引为83.我没有使用str.index()，因为返回的索引可能与相关标签不对应。

Answer 1

你可以这样做。

print(soup.find("p").text)

输出是，

The Dormouse's story

可以更改html_doc内容以验证代码逻辑。

像这样更改html_doc。

html_doc = """
<html><head><title>The EEEE's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
</body>
</html>
"""

代码与上面的输出相同。

Answer 2

看起来你正在做一些网络抓取。我建议您查看XPath - Google，了解您编写语言的XPath库。

使用XPath选择器，您可以找到如下文本元素：

("//text()[contains(.,"The Dormouse's story")]")

从这里开始，如果你需要段落元素，只需要选择它的父类。

Answer 3

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="title"><b>The Dormouse's story</b></p>
</body>
</html>
"""
def findall(patt, s):
    '''Yields all the positions of the pattern patt in the string s.'''
    i = s.find(patt)
    while i != -1:
        yield i
        i = s.find(patt, i+1)

soup = BeautifulSoup(html_doc, 'html.parser')
x = str(soup)
y = str(soup.find("p", {'class':'title'}))
print([(i, x[i:i+len(y)]) for i in findall(y, x)])

在BeautifulSoup中查找标记的字符串索引

3 个答案: