我是Scrapy框架的新手,正在尝试使用Spider抓取网站。在我的网站上,当我从第1页导航 - >第2页,中间页面添加了Meta Refresh,将其重定向到第2页。但是我在重定向时不断收到错误302。我尝试了以下事情
将用户代理设置为" Mozilla / 5.0(Windows NT 6.1)AppleWebKit / 537.36(KHTML,如Gecko)Chrome / 56.0.2924.87 Safari / 537.36"
设置DOWNLOAD_DELAY = 15
设置REDIRECT_MAX_METAREFRESH_DELAY = 100
但是我没有成功。我是Scrapy的新手。如果有人帮助我提供解决此问题的方向,我将不胜感激。
根据请求添加日志
2017-02-17 21:02:43 [scrapy.core.engine] INFO: Spider opened
2017-02-17 21:02:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2017-02-17 21:02:43 [scrapy.extensions.telnet] DEBUG: Telnet console listening o
n 127.0.0.1:6023
2017-02-17 21:02:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://xxxx.website.com/search-cases.htm> (referer: None)
2017-02-17 21:02:44 [quotes] INFO: http://www.xxxx.website2.com/e
services/home.page
2017-02-17 21:02:46 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (
meta refresh) to <GET http://www.xxxx.website2.com/eservices/;jsessionid=D
724B51CE14CFB9A06AB5A1C2BADC7BA?x=pQSPWmZkMdOltOc6jey5Pzm2g*gqQrsim1X*85dDjm1K*V
wIS*xP-fdT9lRZBHHOA41kK1OaAco2dC8Un6N*uJtWnK50mGmm> from <GET http://www.courtre
cords.alaska.gov/eservices/home.page>
2017-02-17 21:02:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (
302) to <GET http://www.xxxx.website2.com/eservices/home.page> from <GET h
ttp://www.xxxx.website2.com/eservices/;jsessionid=D724B51CE14CFB9A06AB5A1C
2BADC7BA?x=pQSPWmZkMdOltOc6jey5Pzm2g*gqQrsim1X*85dDjm1K*VwIS*xP-fdT9lRZBHHOA41kK
1OaAco2dC8Un6N*uJtWnK50mGmm>
2017-02-17 21:02:55 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET
http://www.xxxx.website2.com/eservices/home.page> - no more duplicates wi
ll be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-02-17 21:02:55 [scrapy.core.engine] INFO: Closing spider (finished)
**请注意我已更改网站名称**
答案 0 :(得分:0)
正如@eLRuLL在评论中提到的,问题是重复的请求正在被过滤。为重定向请求设置 dont_filter = True 后,程序开始正确抓取