这是我第一次使用代理scrapy。当我测试我的代码时,会发生错误,但我无法找到我的代码错误的位置。
pycharm告诉我错误:下载https://movie.douban.com/subject/25754848/reviews>和TypeError:to_bytes必须接收unicode,str或bytes对象,得到NoneType。
以下是中间件代码。
import requests
import lxml
from bs4 import BeautifulSoup
from scrapy import signals
class ProxyMiddleware(object):
def process_request(self, request, spider):
url = 'http://127.0.0.1:5000/get'
r = requests.get(url)
request.meta['proxy'] = BeautifulSoup(r.text, "lxml").get_text()
代码评论:我有一个代理池。当它运行时,我可以从地址“http://127.0.0.1:5000/get”获得不同的代理IP和端口,如“113.122.136.41:808”
以下是错误和追溯列表。
2017-04-16 10:20:06 [scrapy.core.scraper] ERROR: Error downloading <GET
https://movie.douban.com/subject/25754848/reviews>
Traceback (most recent call last):
File "C:\Users\empra\AppData\Local\Programs\Python\Python36\lib\site-packages\twisted\internet\defer.py", line 1299, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "C:\Users\empra\AppData\Local\Programs\Python\Python36\lib\site-packages\twisted\python\failure.py", line 393, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "C:\Users\empra\AppData\Local\Programs\Python\Python36\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "C:\Users\empra\AppData\Local\Programs\Python\Python36\lib\site-packages\scrapy\utils\defer.py", line 45, in mustbe_deferred
result = f(*args, **kw)
File "C:\Users\empra\AppData\Local\Programs\Python\Python36\lib\site-packages\scrapy\core\downloader\handlers\__init__.py", line 65, in download_request
return handler.download_request(request, spider)
File "C:\Users\empra\AppData\Local\Programs\Python\Python36\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 61, in download_request
return agent.download_request(request)
File "C:\Users\empra\AppData\Local\Programs\Python\Python36\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 260, in download_request
agent = self._get_agent(request, timeout)
File "C:\Users\empra\AppData\Local\Programs\Python\Python36\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 240, in _get_agent
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
File "C:\Users\empra\AppData\Local\Programs\Python\Python36\lib\site-packages\scrapy\core\downloader\webclient.py", line 37, in _parse
return _parsed_url_args(parsed)
File "C:\Users\empra\AppData\Local\Programs\Python\Python36\lib\site-packages\scrapy\core\downloader\webclient.py", line 20, in _parsed_url_args
host = b(parsed.hostname)
File "C:\Users\empra\AppData\Local\Programs\Python\Python36\lib\site-packages\scrapy\core\downloader\webclient.py", line 17, in <lambda>
b = lambda s: to_bytes(s, encoding='ascii')
File "C:\Users\empra\AppData\Local\Programs\Python\Python36\lib\site-packages\scrapy\utils\python.py", line 117, in to_bytes
'object, got %s' % type(text).__name__)
TypeError: to_bytes must receive a unicode, str or bytes object, got NoneType
答案 0 :(得分:0)
我可以告诉你如何将来自url的流转换为unicode。
import requests
import urllib2
import lxml
from bs4 import BeautifulSoup
from scrapy import signals
class ProxyMiddleware(object):
def process_request(self, request, spider):
url = 'http://127.0.0.1:5000/get'
r = requests.urlib2.open(url).read()
data=r.decode("utf-8")
request.meta['proxy'] = BeautifulSoup(data, "lxml").get_text()