我正在尝试制作一个简单的程序,该程序将获取网站上的所有图像地址,然后将其下载到文件夹中。 问题是我收到403错误。 我已经尝试修复了一个多小时,非常需要帮助。 这是我的代码:
import urllib.request
import requests
from bs4 import BeautifulSoup
url = 'https://www.webtoons.com/en/slice-of-life/how-to-love/ep-100-happy-ending-last-episode/viewer?title_no=472&episode_no=100'
data = requests.get(url)
code = BeautifulSoup(data.text, 'html.parser')
photos = []
def dl_jpg(url, filePath, fileName):
fullPath = filePath + fileName + '.jpg'
urllib.request.urlretrieve(url, fullPath)
for img in code.find('div', id='_imageList'):
pic = str(img)[43:147]
photos.append(str(pic))
for photo in photos:
if photo == '':
photos.remove(photo)
for photo in photos[0:-4]:
dl_jpg(photo, 'images/', 'img')
答案 0 :(得分:0)
网站通常会阻止没有用户代理的请求。我更新了您的代码以与请求一起发送用户代理。我还选择只使用requests
库并抛弃urllib
。尽管urllib
支持标头更改,但是您已经在使用requests
,对此我更加熟悉。
我还建议在请求之间添加延迟/睡眠,30-45秒是一个不错的选择。这样可以避免向网站发送垃圾邮件并拒绝提供服务。如果您发送的邮件太快,某些网站也会阻止您的请求。
import requests
from bs4 import BeautifulSoup
user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.37"
url = 'https://www.webtoons.com/en/slice-of-life/how-to-love/ep-100-happy-ending-last-episode/viewer?title_no=472&episode_no=100'
data = requests.get(url, headers={'User-Agent': user_agent})
code = BeautifulSoup(data.text, 'html.parser')
photos = []
def dl_jpg(url, filePath, fileName):
fullPath = filePath + fileName + '.jpg'
# make request with user-agent. If request is successful then save the result.
image_request = requests.get(url, headers={'User-Agent': user_agent})
if image_request.status_code == 200:
with open(fullPath, 'wb') as f:
f.write(image_request.content)
for img in code.find('div', id='_imageList'):
pic = str(img)[43:147]
photos.append(str(pic))
for photo in photos:
if photo == '':
photos.remove(photo)
for photo in photos[0:-4]:
dl_jpg(photo, 'images/', 'img')