我是位新手,我想使用代理中间件。但是我的DEBUG消息显示
2018-09-10 21:15:57 [scrapy.core.engine] INFO: Spider opened
2018-09-10 21:15:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-09-10 21:15:57 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-10 21:16:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-09-10 21:17:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-09-10 21:18:05 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.zhipin.com/robots.txt> (failed 1 times): TCP connection timed out: 110: Connection timed out.
2018-09-10 21:18:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
它将始终抓取0页并重试。我的代理是免费的,不需要授权。但是我尝试删除代理中间件并使用
yield scrapy.Request(url='https://www.example.com/', callback=self.parse_first, meta=my_proxy)
没关系。我的设置似乎有问题。
设置
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':135,
'ip_proxy.middlewares.CustomProxyMiddleware':125
}
CustomProxyMiddleware
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = my_proxy
蜘蛛
class ipSpider(scrapy.Spider):
name = "test"
def start_requests(self):
yield scrapy.Request(url="https://www.example.com",callback=self.parse_first)
答案 0 :(得分:1)
听起来您没有更改默认的 ROBOTXT_OBEY 设置 设置 ROBOTXT_OBEY = False ,然后尝试。我会的。