Python
代码也添加了User-Agent
,但操作仍然是以下错误,解决方案是什么?已添加从浏览器获取的Request Header
。它仍然无用.p:手动打开网页,可以正常访问,但代码发送请求,提示403:
import requests, time, os, urllib.request, socket
from bs4 import BeautifulSoup
def getimg():
os.system("mkdir Pic")
headers = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,ja;q=0.7",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Host": "cc.itbb.men",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}
r = requests.get("http://www.testowne.er/htm_data/8/1804/3099535.html", headers=headers)
r.encoding = 'GBK'
soup = BeautifulSoup(r.text, "html.parser")
iname = 0
for i in soup.find_all("input", type="image"):
iname += 1
i = i['src']
print(i)
urllib.request.urlretrieve(i, ".\\Pic\\%s" % str(iname))
========================输出====================== ========================
Traceback (most recent call last):
File "getimg.py", line 70, in <module>
getimg()
File "getimg.py", line 41, in getimg
urllib.request.urlretrieve(i, ".\\Pic\\%s" % str(iname))
File "/usr/lib/python3.5/urllib/request.py", line 188, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.5/urllib/request.py", line 472, in open
response = meth(req, response)
File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.5/urllib/request.py", line 510, in error
return self._call_chain(*args)
File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
result = func(*args)
File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
答案 0 :(得分:0)
正如本answer中所述:
此网站阻止了urllib使用的用户代理,因此您需要 在您的请求中更改它。不幸的是,我不认为urlretrieve 直接支持这个。
然而,使用shutil.copyfileobj()
保存文件并不适用于我。我改用了这个:
r_img = requests.get(url, stream=True)
if r_img.status_code == 200:
with open("img.jpg", 'wb') as f:
f.write(r_img.content)
完整代码:
import os
import requests
from bs4 import BeautifulSoup
def download_images(url: str) -> None:
os.system('mkdir Pictures')
r = requests.get(url)
r.encoding = 'GBK'
soup = BeautifulSoup(r.text, 'html.parser')
for i, img in enumerate(soup.find_all('input', type='image')):
img_url = img['src']
print(i, img_url)
r_img = requests.get(img_url, stream=True)
if r_img.status_code == 200:
with open(f'Pictures/pic{i}.jpg', 'wb') as f:
f.write(r_img.content)
download_images('http://cc.itbb.men/htm_data/8/1804/3099535.html')
注意使用f-string格式化路径。它适用于Python 3.6+
,如果您使用旧版本的Python,则可以更改为%
或.format()
。 Type hints我在函数签名中添加了Python 3.5+
的功能。如果使用较旧的Python,也可以省略它们。