我使用crawlspider如何在规则中停止重定向

时间:2017-07-05 01:26:42

标签: python web-crawler

这是我的规则,这是我第一次使用crawlspider,所以如何在我的规则中停止重定向(302)

rules = (
        Rule(LinkExtractor(allow=r'zhaopin/.*'), follow=True),
        Rule(LinkExtractor(allow=r'gongsi/j.*/.html'), follow=True),
        Rule(LinkExtractor(allow=r'jobs/.*.html'), callback='parse_job', follow=True),
    )

这是调试,你可以看到,

2017-07-05 09:20:24 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.lagou.com/login/login.html?msg=validation&uStatus=2&clientIp=60.211.222.66> from <GET https://www.lagou.com/zhaopin/CTO/>
2017-07-05 09:20:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.lagou.com/login/login.html?msg=validation&uStatus=2&clientIp=60.211.222.66> from <GET https://www.lagou.com/zhaopin/jiagoushi/>
2017-07-05 09:20:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.lagou.com/login/login.html?msg=validation&uStatus=2&clientIp=60.211.222.66> from <GET https://www.lagou.com/zhaopin/C%23/>
2017-07-05 09:20:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.lagou.com/login/login.html?msg=validation&uStatus=2&clientIp=60.211.222.66> from <GET https://www.lagou.com/zhaopin/youxizhizuoren/>
2017-07-05 09:20:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.lagou.com/login/login.html?msg=validation&uStatus=2&clientIp=60.211.222.66> from <GET https://www.lagou.com/zhaopin/chanpinbujingli/>
2017-07-05 09:20:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.lagou.com/login/login.html?msg=validation&uStatus=2&clientIp=60.211.222.66> from <GET https://www.lagou.com/zhaopin/wuxianchanpinshejishi/>
2017-07-05 09:20:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.lagou.com/login/login.html?msg=validation&uStatus=2&clientIp=60.211.222.66> from <GET https://www.lagou.com/zhaopin/wangyechanpinshejishi/>
2017-07-05 09:20:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.lagou.com/login/login.html?msg=validation&uStatus=2&clientIp=60.211.222.66> from <GET https://www.lagou.com/zhaopin/chanpinshixisheng/>
2017-07-05 09:20:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.lagou.com/login/login.html?msg=validation&uStatus=2&clientIp=60.211.222.66> from <GET https://www.lagou.com/zhaopin/dbaqita/>
2017-07-05 09:20:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://passport.lagou.com/login/login.html?msg=validation&uStatus=2&clientIp=60.211.222.66> from <GET https://www.lagou.com/zhaopin/guanggaoshejishi/>
2017-07-05 09:20:26 [scrapy.crawler] INFO: Received SIG_UNBLOCK, shutting down gracefully. Send again to force 

1 个答案:

答案 0 :(得分:0)

在设置中添加Cookie和User-Agent,就像

一样
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',
    'Cookie': 'user_trace_token=201708...',
    'Referer': 'https://www.lagou.com'
}