Question

我正在尝试根据属性值的一部分识别html文档中的标记。

例如，如果我有一个Beautifulsoup对象：

import bs4 as BeautifulSoup

r = requests.get("http:/My_Page")

soup = BeautifulSoup(r.text, "html.parser")

我想要tr个带有id属性的标记，其值的格式如下：“news_4343_23255_xxx”。我对任何tr标记感兴趣，只要它有“新闻”作为id属性值的前4个字符。

我知道我可以搜索如下：

trs = soup.find_all("tr",attrs={"id":True})

它为我提供了tr个id属性。

如何根据子字符串进行搜索？

Answer 1

使用正则表达式从tr开始id "news"

<强>实施例

from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html,  "html.parser")
for i in soup.find_all("tr", {'id': re.compile(r'^news')}):
    print(i)

Answer 2

试试这个：

trs = soup.find_all("tr", id=lambda x: x and x.startswith('news_')

在此处引用：Matching id's in BeautifulSoup

Answer 3

你可以使用正则表达式。

import re
from bs4 import BeautifulSoup
import requests


r = requests.get("example")

soup = BeautifulSoup(r.text, 'html.parser')
regex = re.compile('news')
news = soup.find_all("td", {"class" : regex})
print news

Answer 4

我在这里添加了另一种情况，这不能解决上面提出的确切问题，但可以通过仅查看标题来帮助某人是否来过这里

使用正则表达式

情况：：假设我有div和class的列表，如下所示：

<div class='abcd bcde cdef efgh'>some content</div>
<div class='mnop bcde cdef efgh'>some content</div>
<div class='abcd pqrs cdef efgh'>some content</div>
<div class='hijk wxyz cdef efgh'>some content</div>

可以观察到，div以上的 类值字符串 以cdef efgh结尾；将所有这些提取到一个列表中：

from bs4 import BeautifulSoup
import re # library for regex in python
soup = BeautifulSoup(<your_html_response>, <parser_you_want_to_use>)
elements = soup.find_all('div', {'class': re.compile(r'cdef efgh$')})

想知道什么是

'r'表示：click here
'$'表示'cdef efgh'必须在字符串结尾

注意：这只是一种情况。您可以使用正则表达式进行几乎所有情况的处理。了解更多信息，并在https://regex101.com/
尝试使用正则表达式

美丽的汤基于部分属性值查找标签

4 个答案:

使用正则表达式