我的目标是构建一个网络爬虫并在GAE上托管它。但是,当我尝试执行一个非常基本的实现时,我收到以下错误:
Traceback (most recent call last):
File "C:\Program Files (x86)\Google\google_appengine\lib\webapp2-2.5.2\webapp2.py", line 1535, in __call__
rv = self.handle_exception(request, response, e)
File "C:\Program Files (x86)\Google\google_appengine\lib\webapp2-2.5.2\webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
File "C:\Program Files (x86)\Google\google_appengine\lib\webapp2-2.5.2\webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "C:\Program Files (x86)\Google\google_appengine\lib\webapp2-2.5.2\webapp2.py", line 1102, in __call__
return handler.dispatch()
File "C:\Program Files (x86)\Google\google_appengine\lib\webapp2-2.5.2\webapp2.py", line 572, in dispatch
return self.handle_exception(e, self.app.debug)
File "C:\Program Files (x86)\Google\google_appengine\lib\webapp2-2.5.2\webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "E:\WSE_NewsClusteriing\crawler\crawler.py", line 14, in get
source_code = requests.get(url)
File "libs\requests\api.py", line 67, in get
return request('get', url, params=params, **kwargs)
File "libs\requests\api.py", line 53, in request
return session.request(method=method, url=url, **kwargs)
File "libs\requests\sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "libs\requests\sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "libs\requests\adapters.py", line 376, in send
timeout=timeout
File "libs\requests\packages\urllib3\connectionpool.py", line 559, in urlopen
body=body, headers=headers)
File "libs\requests\packages\urllib3\connectionpool.py", line 390, in _make_request
assert_header_parsing(httplib_response.msg)
File "libs\requests\packages\urllib3\util\response.py", line 49, in assert_header_parsing
type(headers)))
TypeError: expected httplib.Message, got <type 'instance'>.
我的main.py如下:
import sys
sys.path.insert(0, 'libs')
import webapp2
import requests
from bs4 import BeautifulSoup
class MainPage(webapp2.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
url = 'http://www.bbc.com/news/world'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'title-link'}):
href = 'http://www.bbc.com' + link.get('href')
self.response.write(href)
app = webapp2.WSGIApplication([
('/', MainPage),
], debug=True)
问题是爬虫可以作为独立的python应用程序正常工作。
有人可以帮我弄清楚这里有什么问题吗?请求模块是否会导致与GAE的一些兼容性问题?
答案 0 :(得分:7)
我建议暂时不要在App Engine上使用requests
库,因为它没有得到官方支持。因此很可能遇到兼容性问题。根据{{3}}文章,受支持的库包括urllib
,urllib2
,httplib
和直接使用urlfetch
。 requests
库的某些功能也可能基于给定URL Fetch Python API的urllib3
库。此库尚不支持。
有关urllib2
和urlfetch
请求的简单示例,请随时咨询their collaboration。如果这些图书馆不适合您,请随时在您的问题中指出我们。
答案 1 :(得分:3)
这是近两年的问题,但我刚才在发动机上偶然发现了这个问题。 为了那些可能遇到类似问题的人的利益,docs描述了如何发出HTTP(S)请求
import requests
import requests_toolbelt.adapters.appengine
# Use the App Engine Requests adapter. This makes sure that Requests uses
# URLFetch.
requests_toolbelt.adapters.appengine.monkeypatch()
Referance https://cloud.google.com/appengine/docs/standard/python/issue-requests