Question

我只想获取以href开头的https。

 (some texts(type='bs4.BeautifulSoup')).find_all("a",href="https") can not get url links.

我正在制作抓取工具。

Answer 1

使用以^运算符开头的css attribute =值选择器。可以肯定，但是不能很快找到一个很好的例子。

 links = [link['href'] for link in soup.select('[href^='https'])]

Answer 2

您还可以使用 find_all

中的正则表达式过滤标签的href属性

soup.find_all('a',href=re.compile('^https'))

演示

from bs4 import BeautifulSoup
import re
html="""
<a href="https://www.google.com">Secure</a>
<a href="http://www.google.com">Not Secure</a>
"""
soup=BeautifulSoup(html,'html.parser')
print(soup.find_all('a',href=re.compile('^https')))

输出：

[<a href="https://www.google.com">Secure</a>]

文档：

The keyword arguments

A regular expression as filter

从bs4.BeautifulSoup获取URL链接

2 个答案: