Python爬虫遇到错误.HTTPError:HTTP错误403:禁止

时间:2018-04-08 01:30:27

标签: python-3.x

Python代码也添加了User-Agent,但操作仍然是以下错误,解决方案是什么?已添加从浏览器获取的Request Header。它仍然无用.p:手动打开网页,可以正常访问,但代码发送请求,提示403:

import requests, time, os, urllib.request, socket
from bs4 import BeautifulSoup

def getimg():
    os.system("mkdir Pic")
    headers = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
           "Accept-Encoding": "gzip, deflate",
           "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,ja;q=0.7",
           "Cache-Control": "max-age=0",
           "Connection": "keep-alive",
           "Host": "cc.itbb.men",
           "Upgrade-Insecure-Requests": "1",
           "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}
    r = requests.get("http://www.testowne.er/htm_data/8/1804/3099535.html", headers=headers)
    r.encoding = 'GBK'
    soup = BeautifulSoup(r.text, "html.parser")
    iname = 0
    for i in soup.find_all("input", type="image"):
        iname += 1
        i = i['src']
        print(i)
        urllib.request.urlretrieve(i, ".\\Pic\\%s" % str(iname))

========================输出====================== ========================

Traceback (most recent call last):
  File "getimg.py", line 70, in <module>
    getimg()
  File "getimg.py", line 41, in getimg
    urllib.request.urlretrieve(i, ".\\Pic\\%s" % str(iname))
  File "/usr/lib/python3.5/urllib/request.py", line 188, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 472, in open
    response = meth(req, response)
  File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.5/urllib/request.py", line 510, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

1 个答案:

答案 0 :(得分:0)

正如本answer中所述:

  

此网站阻止了urllib使用的用户代理,因此您需要   在您的请求中更改它。不幸的是,我不认为urlretrieve   直接支持这个。

然而,使用shutil.copyfileobj()保存文件并不适用于我。我改用了这个:

r_img = requests.get(url, stream=True)
if r_img.status_code == 200:
    with open("img.jpg", 'wb') as f:
        f.write(r_img.content)

完整代码:

import os

import requests
from bs4 import BeautifulSoup


def download_images(url: str) -> None:
    os.system('mkdir Pictures')
    r = requests.get(url)
    r.encoding = 'GBK'
    soup = BeautifulSoup(r.text, 'html.parser')

    for i, img in enumerate(soup.find_all('input', type='image')):
        img_url = img['src']
        print(i, img_url)
        r_img = requests.get(img_url, stream=True)
        if r_img.status_code == 200:
            with open(f'Pictures/pic{i}.jpg', 'wb') as f:
                f.write(r_img.content)


download_images('http://cc.itbb.men/htm_data/8/1804/3099535.html')

注意使用f-string格式化路径。它适用于Python 3.6+,如果您使用旧版本的Python,则可以更改为%.format()Type hints我在函数签名中添加了Python 3.5+的功能。如果使用较旧的Python,也可以省略它们。