我有一个很好的URL结构可以循环通过:
https://marco.ccr.buffalo.edu/images?page=0&score=Clear
https://marco.ccr.buffalo.edu/images?page=1&score=Clear
https://marco.ccr.buffalo.edu/images?page=2&score=Clear
...
我想循环浏览每个页面并下载21张图像(JPEG或PNG)。我已经看过几个Beautiful Soap的示例,但是Im仍在努力获取可以下载多个图像并遍历URL的内容。我想我可以使用urllib这样遍历每个URL,但是我不确定图像保存在哪里。任何帮助将不胜感激,并在此先感谢!
for i in range(0,10):
urllib.urlretrieve('https://marco.ccr.buffalo.edu/images?page=' + str(i) + '&score=Clear')
我正在尝试关注此帖子,但未成功: How to extract and download all images from a website using beautifulSoup?
答案 0 :(得分:3)
您可以使用requests
:
from bs4 import BeautifulSoup as soup
import requests, contextlib, re, os
@contextlib.contextmanager
def get_images(url:str):
d = soup(requests.get(url).text, 'html.parser')
yield [[i.find('img')['src'], re.findall('(?<=\.)\w+$', i.find('img')['alt'])[0]] for i in d.find_all('a') if re.findall('/image/\d+', i['href'])]
n = 3 #end value
os.system('mkdir MARCO_images') #added for automation purposes, folder can be named anything, as long as the proper name is used when saving below
for i in range(n):
with get_images(f'https://marco.ccr.buffalo.edu/images?page={i}&score=Clear') as links:
print(links)
for c, [link, ext] in enumerate(links, 1):
with open(f'MARCO_images/MARCO_img_{i}{c}.{ext}', 'wb') as f:
f.write(requests.get(f'https://marco.ccr.buffalo.edu{link}').content)
现在,检查MARCO_images
目录的内容将产生:
print(os.listdir('/Users/ajax/MARCO_images'))
输出:
['MARCO_img_1.jpg', 'MARCO_img_10.jpg', 'MARCO_img_11.jpg', 'MARCO_img_12.jpg', 'MARCO_img_13.jpg', 'MARCO_img_14.jpg', 'MARCO_img_15.jpg', 'MARCO_img_16.jpg', 'MARCO_img_17.jpg', 'MARCO_img_18.jpg', 'MARCO_img_19.jpg', 'MARCO_img_2.jpg', 'MARCO_img_20.jpg', 'MARCO_img_21.jpg', 'MARCO_img_3.jpg', 'MARCO_img_4.jpg', 'MARCO_img_5.jpg', 'MARCO_img_6.jpg', 'MARCO_img_7.jpg', 'MARCO_img_8.jpg', 'MARCO_img_9.jpg']