尝试获取包含https:// with BeautifulSoup
的所有图像srcimage_list = []
url = 'www.example.com'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib")
for link in soup.find_all('img'):
image_list.append(link.get('src'))
for link in image_list:
if 'https' not in link:
image_list.remove(link)
答案 0 :(得分:1)
您可以检查src
是否以https
开头,然后对其进行过滤,例如:
from bs4 import BeautifulSoup
image_list=[]
div_test="""
<html>
<div id="d1">
Text 1
</div>
<img src="http://test1.com/1.jpg"></img>
<div id="d2">
Text 2
<a href="http://my.url/">a url</a>
Text 2 continue
</div>
<img src="https://test2.com/2.jpg"></img>
<div id="d3">
Text 3
</div>
<img src="https://test3.com/3.jpg"></img>
</html>
"""
soup = BeautifulSoup(div_test, 'html.parser')
for link in soup.find_all('img'):
src = link.get('src')
if src.startswith("https"): #check src starts with https
image_list.append(src)
print(image_list)
image_list
仅适用于https
:
[u'https://test2.com/2.jpg', u'https://test3.com/3.jpg']