Question

我有以下html（标有\ n的换行符）：

...
<tr>
  <td class="pos">\n
      "Some text:"\n
      <br>\n
      <strong>some value</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Fixed text:"\n
      <br>\n
      <strong>text I am looking for</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Some other text:"\n
      <br>\n
      <strong>some other value</strong>\n
  </td>
</tr>
...

如何找到我正在寻找的文字？下面的代码返回第一个找到的值，因此我需要以某种方式按固定文本进行过滤。

result = soup.find('td', {'class' :'pos'}).find('strong').text

UPD 即可。如果我使用以下代码：

title = soup.find('td', text = re.compile(ur'Fixed text:(.*)', re.DOTALL), attrs = {'class': 'pos'})
self.response.out.write(str(title.string).decode('utf8'))

然后只返回固定文字。

Answer 1

您可以将正则表达式传递给findAll的text参数，如下所示：

import BeautifulSoup
import re

columns = soup.findAll('td', text = re.compile('your regex here'), attrs = {'class' : 'pos'})

Answer 2

这篇文章让我得到了答案，尽管这篇文章没有回答。我觉得我应该回馈。

这里的挑战是BeautifulSoup.find在使用和不使用文本进行搜索时的不一致行为。

注意：如果你有BeautifulSoup，你可以通过以下方式在本地测试：

curl https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.py | python

代码： https://gist.github.com/4060082

# Taken from https://gist.github.com/4060082 from BeautifulSoup import BeautifulSoup from urllib2 import urlopen from pprint import pprint import re soup = BeautifulSoup(urlopen('https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.html').read()) # I'm going to assume that Peter knew that re.compile is meant to cache a computation result for a performance benefit. However, I'm going to do that explicitly here to be very clear. pattern = re.compile('Fixed text') # Peter's suggestion here returns a list of what appear to be strings columns = soup.findAll('td', text=pattern, attrs={'class' : 'pos'}) # ...but it is actually a BeautifulSoup.NavigableString print type(columns[0]) #>> <class 'BeautifulSoup.NavigableString'> # you can reach the tag using one of the convenience attributes seen here pprint(columns[0].__dict__) #>> {'next': , #>> 'nextSibling': , #>> 'parent': <td class="pos">\n #>> "Fixed text:"\n #>> \n #>> text I am looking for\n #>> </td>, #>> 'previous': <td class="pos">\n #>> "Fixed text:"\n #>> \n #>> text I am looking for\n #>> </td>, #>> 'previousSibling': None} # I feel that 'parent' is safer to use than 'previous' based on http://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names # So, if you want to find the 'text' in the 'strong' element... pprint([t.parent.find('strong').text for t in soup.findAll('td', text=pattern, attrs={'class' : 'pos'})]) #>> [u'text I am looking for'] # Here is what we have learned: print soup.find('strong') #>> some value print soup.find('strong', text='some value') #>> u'some value' print soup.find('strong', text='some value').parent #>> some value print soup.find('strong', text='some value') == soup.find('strong') #>> False print soup.find('strong', text='some value') == soup.find('strong').text #>> True print soup.find('strong', text='some value').parent == soup.find('strong') #>> True

虽然帮助OP肯定为时已晚，但我希望他们能够将此作为答案，因为它确实能够满足所有关于通过文本查找的窘境。

Answer 3

对于bs4 4.7.1+，您可以使用：contains伪类来指定包含搜索字符串的td

from bs4 import BeautifulSoup
html = '''
<tr>
  <td class="pos">\n
      "Some text:"\n
      <br>\n
      <strong>some value</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Fixed text:"\n
      <br>\n
      <strong>text I am looking for</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Some other text:"\n
      <br>\n
      <strong>some other value</strong>\n
  </td>
</tr>'''
soup = bs(html, 'lxml')
print(soup.select_one('td:contains("Fixed text:")'))

Answer 4

具有特定关键字的查找锚标记的解决方案如下：

from bs4 import BeautifulSoup
from urllib.request import urlopen,Request
from urllib.parse import urljoin,urlparse

rawLinks=soup.findAll('a',href=True)
for link in rawLinks:
    innercontent=link.text
    if keyword.lower() in innercontent.lower():
        print(link)

Answer 5

result = soup.find('strong', text='text I am looking for').text

Answer 6

自Beautiful Soup 4.4.0.以来，称为string的参数就完成了text在以前的版本中所做的工作。

string用于查找字符串，您可以将其与查找标签的参数组合：Beautiful Soup将查找所有.string与您的字符串值匹配的标签。此代码查找其.string为“ Elsie”的标签：

soup.find_all("td", string="Elsie")

有关字符串的更多信息，请参见https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument这一节

Answer 7

您可以通过一些简单的gazpacho解析来解决此问题：

from gazpacho import Soup

soup = Soup(html)
tds = soup.find("td", {"class": "pos"})
tds[1].find("strong").text

将输出：

我正在寻找的文字

Answer 8

您可以使用 Beautiful Soup 的 CSS 选择器方法。

from bs4 import BeautifulSoup
from bs4.element import Tag
from typing import List

# This will work as of BeautifulSoup 4.9.1.
result: List[Tag] = BeautifulSoup(html_string, 'lxml').select(
    'tr td strong:contains("text I am looking for")'
    )
print(result)

<块引用>

[我要找的文字]

?

如何使用Beautiful Soup查找带有特定文本的标签？

8 个答案: