urllib.error.HTTPError:HTTP错误405:python 3.X中不允许如何摆脱机器人检测

时间:2017-05-13 07:56:17

标签: python python-3.x http beautifulsoup urllib

我对编码完全不熟悉。我正在尝试创建一个程序来为我收集数据但是当我编码打开网址时,它说HTTPError:HTTP错误405:不允许我使用Python,我安装了Beautiful Soup但是出于某种原因我得到了这个错误?我尝试了不同的标题,但没有奏效。这是下面的编码。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import urllib.request
import re
import numpy as np
# Opening the Builder website

html = "http://www.builderonline.com"
req = urllib.request.Request(html,headers={'User-Agent' : "Mozilla/5.0"})
soup = BeautifulSoup(urlopen(req).read(),"html.parser")
print ("end")


Error Messages:
Traceback (most recent call last):
  File "test3.py", line 9, in <module>
    soup = BeautifulSoup(urlopen(req).read(),"html.parser")
  File "/Users/NAGS/anaconda/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/NAGS/anaconda/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/Users/NAGS/anaconda/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Users/NAGS/anaconda/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/Users/NAGS/anaconda/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/Users/NAGS/anaconda/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 405: Not Allowed

2 个答案:

答案 0 :(得分:0)

此页面包含Captcha和没有防止JavaScript的用户。试试这段代码:

import requests
from bs4 import BeautifulSoup
request_page = requests.get('http://www.builderonline.com')
soup = BeautifulSoup(request_page.text, 'lxml')


for i in soup.findAll('li'):
    print(i.text)

如果您想从网站上搜索/获取数据,我建议使用 Selenium PhantomJS (无头浏览器)

至于错误405:

  1. 打开与该IP地址的IP套接字连接。
  2. 通过该套接字写入HTTP数据流。
  3. 作为响应,从Web服务器接收HTTP数据流。 此数据流包含状态代码,其值由HTTP协议确定。
  4.   

    当客户端收到一个错误时,会在上面的最后一步中发生此错误   它识别为“405”的HTTP状态代码。

    令人敬畏的教程HERE

答案 1 :(得分:0)

使用请求 BeautifulSoup ,我能够非常轻松地抓取列表标记:

>>> import requests
>>> from pprint import pprint #for readability
>>> from bs4 import BeautifulSoup as BS
>>> headers = {"user-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:53.0) Gecko/20100101 Firefox/53.0"}
>>> response = requests.get('http://www.builderonline.com', headers=headers)
>>> soup = BS(response.text, 'lxml')

和输出(使用pprint):

>>> for i in soup.find_all('li'):
...     pprint(i)
... 
<li><a class="fa fa-facebook" data-cms-ai="0" href="https://www.facebook.com/buildermagazine" target="_blank" title="Find Us on Facebook"><span class="alt">Facebook</span></a></li>
<li><a class="fa fa-twitter" data-cms-ai="0" href="https://twitter.com/builderonline" target="_blank" title="Find Us on Twitter"><span class="alt">Twitter</span></a></li>
<li><a class="fa fa-linkedin" data-cms-ai="0" href="https://www.linkedin.com/groupInvitation?gid=1296527&amp;fromEmail=&amp;ut=06MzMrJ-ugcmo1" target="_blank" title="Find Us on LinkedIn"><span class="alt">LinkedIn</span></a></li>
<li><a class="fa fa-pinterest" data-cms-ai="0" href="https://www.pinterest.com/builderonline/" target="_blank" title="Find Us on Pinterest"><span class="alt">Pinterest</span></a></li>
<li class="" id="design">
<a data-cms-ai="0" href="http://www.builderonline.com/Design/">Design</a>
<div class="dropdown-menu">
<ul>
<li>
<a data-cms-ai="0" href="http://www.builderonline.com/design/kitchens/">Kitchens</a>
</li>
<li>
<a data-cms-ai="0" href="http://www.builderonline.com/design/baths/">Baths</a>
</li>
<li>
<a data-cms-ai="0" href="http://www.builderonline.com/project-gallery/">Project Gallery</a><span id="cxssebrfzwttbwxusrsud"><a href="eetaayuducztrwztfvtywwescfzddfuybdaxca.html" rel="file" style="display: none;">fxwyvuwtwuaftsd</a></span>
</li>
<li>
<a data-cms-ai="0" href="http://www.builderonline.com/design/projects/">Projects</a>
</li>
<li>
<a data-cms-ai="0" href="http://www.builderonline.com/design/plans/">Plans</a><a href="eetaayuducztrwztfvtywwescfzddfuybdaxca.html" id="cxssebrfzwttbwxusrsud" rel="file" style="display: none;">fxwyvuwtwuaftsd</a>
</li>
<li>
<a data-cms-ai="0" href="http://www.builderonline.com/design/details/">Details</a><span id="cxssebrfzwttbwxusrsud"><a href="eetaayuducztrwztfvtywwescfzddfuybdaxca.html" rel="file" style="display: none;">fxwyvuwtwuaftsd</a></span>
</li>

可能与标题格式的方式有关。也许该网站设置为检查格式错误或不完整的标题。尝试使用浏览器转到https://httpbin.org/headers,然后使用脚本中列出的用户代理数据。