应用错误收集

刮掉多页网站上的所有图像？

时间：2019-09-30 13:56:10

标签： python beautifulsoup

我需要抓取代码中给出的url页面的所有图像，但是我只能手动完成每一页直到最后一页（第100页）。

这是抓取每页的代码，我每次都替换页码并运行代码！


下方


在这种情况下，是否有任何方法可以添加变量函数并运行循环直到出现错误，直到出现404页（因为不再剩余页）？

from bs4 import*
import requests as rq
r2 = rq.get("https://www.gettyimages.in/photos/aishwarya-rai?family=editorial&page=1&phrase=aishwarya%20rai&sort=mostpopular")

soup2 = BeautifulSoup(r2.text, "html.parser") 

links = []

x = soup2.select('img[src^="https://media.gettyimages.com/photos/"]')  #the frame where it shows the images

for img in x:
    links.append(img['src'])


for index, img_link in enumerate(links):
      img_data = rq.get(img_link).content
      with open("aishwarya_rai/"+str(index+2)+'.jpg', 'wb+') as f:
           f.write(img_data)
else:
      f.close()

页面范围是1到100。

我需要一些其他代码，这些代码使“页面值”成为变量并循环到100

1 个答案:

答案 0 :(得分：0)

使用format()函数并传递页面变量。

from bs4 import*
import requests as rq

url="https://www.gettyimages.in/photos/aishwarya-rai?family=editorial&page={}&phrase=aishwarya%20rai&sort=mostpopular"

links = []
for page in range(1,101):
    print(url.format(page))
    r2 = rq.get(url.format(page))
    soup2 = BeautifulSoup(r2.text, "html.parser")
    x = soup2.select('img[src^="https://media.gettyimages.com/photos/"]')  
    for img in x:
      links.append(img['src'])

print(links)