Question

我想知道是否有可能使用适用于所有类型网站的代码来刮擦网站中的图像（我的意思是独立于HTML格式）。我有一份网站蚂蚁清单，需要获取与每个链接相关的所有图像。例如：

list_of links= ['https://www.bbc.co.uk/programmes/articles/5nxMx7d1K8S6nhjkPBFhHSM/withering-wit-and-words-of-wisdom-oscar-wildes-best-quotes'，'https://www.lastampa.it/torino/2020/03/31/news/coronavirus-il-lockdown-ha-gia-salvato-almeno-400-vite-umane-1.38659569'等）

通常，我会使用：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

link='...'

html = urlopen(link)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.jpg')})
for image in images: 
    print(image['src']+'\n')

但是我对html（可以用于每个网站吗？）和图像格式（.jpg；对于所有网站都一样吗？）感到怀疑。

感谢您的所有评论和建议。

Answer 1

假设所有图像都位于src标记内，并且这些图像元素不是动态添加的（不是虚拟DOM），那么稍微修改一下代码就可以了：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

link= '...'

html = urlopen(link)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {})
for image in images: 
    print(image['src']+'\n')

从列表中的网站抓取图像

1 个答案: