Question

我有以下html代码：

soup = BeautifulSoup("<div class='mydiv'><p style='xyz'>123</p><p>456</p><p style='xyz'>789</p><p>abc</p></div>", 'lxml')

现在，我有一个文字“ 456”。

我想在所有具有相同标签名称的标签中找到包含文本“ 456”的文本。

也就是说，在html中，456包含456，则由于abc，我们应该找到abc，而不是123和{{1} }，因为789和中的123。

请注意，上面的789可以是其他标签，例如。

应避免搜索<div>。

最终结果是soup.find('p')。

有点复杂。

我们如何解决这个问题？

谢谢。

Answer 1

此脚本将打印所有与包含字符串“ 456”的标签共享标签名称和标签属性的标签：

txt = '''
    <div class='mydiv'>
        <p style='xyz'>123</p>
        <p>456</p>
        <p style='xyz'>789</p>
        <p>abc</p>
    </div>'''

text_to_find = '456'
soup = BeautifulSoup(txt, 'html.parser')

tmp = soup.find(lambda t: t.contents and t.contents[0] == text_to_find)
if tmp:
    for tag in soup.find_all(lambda t: t.name == tmp.name and t.attrs == tmp.attrs):
        print(tag)

打印：

<p>456</p>
<p>abc</p>

对于输入“ 123”：

<p style="xyz">123</p>
<p style="xyz">789</p>

Answer 2

实际上有多种方法，下面是两个示例，您可以找到所需的内容：

from bs4 import BeautifulSoup

soup = BeautifulSoup("<div class='mydiv'><p style='xyz'>123</p><p>456</p><p style='xyz'>789</p><p>abc</p></div>", 'lxml')

# Find all tags first and then look for the one matching your string
found = [x for x in soup.findAll() if x.text == "456"]

for p in found:
  print(p)

# Using findAll functionality directly
found = soup.findAll(text="456")

for p in found:
  print(p)

456

456

但是请注意，使用第二种方法，您将收到NavigableString个对象，而不是Tag个对象！

Answer 3

尝试：

soup = BeautifulSoup("<div class='mydiv'><p style='xyz'>123</p><p>456</p><p style='xyz'>789</p><p>abc</p></div>", 'html5lib')

tags = soup.find_all()
for tag in tags:
    if tag.get('style'):
        tag.extract()

for tag in soup.select('html body'):
    print(tag.get_text('\n'))

打印：

456
abc

如何在BeautifulSoup中查找给定文本的标签名称

3 个答案: