Question

我正在尝试在网站上抓取一些公共信息（苹果应用程序的信息）。

此网站需要登录才能执行“搜索应用/开发者”等操作。虽然有很多网站提供类似的信息，但我认为这个特定的网站为每个应用程序提供最完整和详细的信息。

作为有效用户，我能够执行任务。

但是，当我尝试通过python代码访问信息时，发送POST请求时遇到403错误，发送Get请求时遇到504错误。

我尝试过使用

真实的userAgent标题
fake-useragent“package
FancyOpener [/像这样，显示为python 3.4折旧]
HttpAuthM .. [/像这样，对于身份验证，仍然无效]

我猜这个网站高度反对自动访问，但是详细的信息非常有用。有什么方法可以解决这个问题吗？

谢谢！

我试过这个标题：

ua = {#'User-Agent':'Mozilla/5.0 (compatible; Googlebot/2.1; +Googlebot - Webmaster Tools Help)',  
      'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36',  
      'Connection':'Keep-Alive',  
      'Accept-Language':'zh-CN,zh;q=0.8',  
      'Accept-Encoding':'gzip,deflate,sdch',  
      'Accept':'*/*',  
      'Accept-Charset':'GBK,utf-8;q=0.7,*;q=0.3',  
      'Cache-Control':'max-age=0'  
      }

503 Error

403 Error

------------------------------------------------ HTTPError   
Traceback (most recent call last) <ipython-input-43-421b27c5194e> in <module>()
     68 data= data.encode('utf-8')
     69 request = urq.Request(url, data, headers = ua)
---> 70 response = urq.urlopen(request)
     71 the_page = response.read()
     72 print(the_page)

c:\python34\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    159     else:
    160         opener = _opener
--> 161     return opener.open(url, data, timeout)
    162 
    163 def install_opener(opener):

c:\python34\lib\urllib\request.py in open(self, fullurl, data, timeout)
    468         for processor in self.process_response.get(protocol, []):
    469             meth = getattr(processor, meth_name)
--> 470             response = meth(req, response)
    471 
    472         return response

c:\python34\lib\urllib\request.py in http_response(self, request, response)
    578         if not (200 <= code < 300):
    579             response = self.parent.error(
--> 580                 'http', request, response, code, msg, hdrs)
    581 
    582         return response

c:\python34\lib\urllib\request.py in error(self, proto, *args)
    506         if http_err:
    507             args = (dict, 'default', 'http_error_default') + orig_args
--> 508             return self._call_chain(*args)
    509 
    510 # XXX probably also want an abstract factory that knows when it makes

c:\python34\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args)
    440         for handler in handlers:
    441             func = getattr(handler, meth_name)
--> 442             result = func(*args)
    443             if result is not None:
    444                 return result

c:\python34\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
    586 class HTTPDefaultErrorHandler(BaseHandler):
    587     def http_error_default(self, req, fp, code, msg, hdrs):
--> 588         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    589 
    590 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: FORBIDDEN
----------------------------------------------

我通过使用“高级REST客户端”获得的以下结果，这是一个用于发送请求的chrome扩展。注意如何在不需要登录的页面上，代码是200;另一个在登录页面为403.请参阅以下评论中的链接

[访问成功] [3]

[访问失败] [4]

Answer 1

普通的python请求包就足够了，你不应该需要其他包。

我确定你的问题只是你没有完全模仿浏览器请求。在Google Chrome和Mozilla Firefox上，您应该能够看到开发人员面板中的请求标题。

请务必始终使用相同的会话对象。

请务必忘记设置正确的标题：

的User-Agent
接受
接受语言
接受编码
Referer （之前GET请求的网址）
连接（保持活力）
主持人（abc.website.com）

session.headers = {
    'User-Agent' : 'real one',
    ...
}

务必尊重重定向：

session.get(url, allow_redirects=True, timeout=x_secs)

在帖子请求中，请务必发送所有必填字段，可能还有一些隐藏字段（如csfr标记）。

Python抓取：403和503错误

1 个答案: