如何使用BeautifulSoup在其中使用https查找图像src

时间:2017-07-01 09:35:01

标签: python beautifulsoup

尝试获取包含https:// with BeautifulSoup

的所有图像src
image_list = []
url = 'www.example.com'
r = requests.get(url)
soup =  BeautifulSoup(r.content, "html5lib")

for link in soup.find_all('img'):
    image_list.append(link.get('src'))

for link in image_list:
    if 'https' not in link:
        image_list.remove(link)

1 个答案:

答案 0 :(得分:1)

您可以检查src是否以https开头,然后对其进行过滤,例如:

from bs4 import BeautifulSoup
image_list=[]
div_test="""
<html>
    <div id="d1">
        Text 1
    </div>
    <img src="http://test1.com/1.jpg"></img>
    <div id="d2">
        Text 2
        <a href="http://my.url/">a url</a>
        Text 2 continue
    </div>
    <img src="https://test2.com/2.jpg"></img>

    <div id="d3">
        Text 3
    </div>
    <img src="https://test3.com/3.jpg"></img>
</html>
"""
soup = BeautifulSoup(div_test, 'html.parser')
for link in soup.find_all('img'):
    src = link.get('src')
    if src.startswith("https"): #check src starts with https
        image_list.append(src)
print(image_list)

image_list仅适用于https

[u'https://test2.com/2.jpg', u'https://test3.com/3.jpg']