所以我试图从具有无限卷轴类型布局的新闻网站上抓取文章,以便发生以下情况:
example.com
有第一页文章
example.com/page/2/
有第二页
example.com/page/3/
有第三页
等等。向下滚动时,网址会发生变化。为了解释这一点,我想抓第一篇x
篇文章,并做了以下几点:
start_urls = ['http://example.com/']
for x in range(1,x):
new_url = 'http://www.example.com/page/' + str(x) +'/'
start_urls.append(new_url)
对于前9页似乎工作正常,我得到如下内容:
Redirecting (301) to <GET http://example.com/page/4/> from <GET http://www.example.com/page/4/>
Redirecting (301) to <GET http://example.com/page/5/> from <GET http://www.example.com/page/5/>
Redirecting (301) to <GET http://example.com/page/6/> from <GET http://www.example.com/page/6/>
Redirecting (301) to <GET http://example.com/page/7/> from <GET http://www.example.com/page/7/>
2017-09-08 17:36:23 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
Redirecting (301) to <GET http://example.com/page/8/> from <GET http://www.example.com/page/8/>
Redirecting (301) to <GET http://example.com/page/9/> from <GET http://www.example.com/page/9/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/10/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/11/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/12/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/13/>
从第10页开始,它会从example.com/
重定向到example.com/page/10/
这样的网页,而不是原始链接example.com/page/10
。什么可能导致这种行为?
我研究了dont_redirect
之类的几个选项,但我不了解发生了什么。这种重新定向行为的原因是什么?特别是当您直接输入example.com/page/10
等网站的链接时,没有重定向?
任何帮助将不胜感激,谢谢!
[编辑]
class spider(CrawlSpider):
start_urls = ['http://example.com/']
for x in range(startPage,endPage):
new_url = 'http://www.example.com/page/' + str(x) +'/'
start_urls.append(new_url)
custom_settings = {'DEPTH_PRIORITY': 1, 'DEPTH_LIMIT': 1}
rules = (
Rule(LinkExtractor(allow=('some regex here,')deny=('example\.com/page/.*','some other regex',),callback='parse_article'),
)
def parse_article(self, response):
#some parsing work here
yield item
是因为我在example\.com/page/.*
中加入了LinkExtractor
吗?那不应该仅适用于不是start_url
的链接吗?
答案 0 :(得分:1)
看起来此网站使用某种安全性来检查请求标头中的User-Agent
。
因此,您只需在User-Agent
文件中添加公共settings.py
:
USER_AGENT = 'Mozilla/5.0'
此外,蜘蛛不一定需要start_urls
属性来获取起始网站,您也可以使用start_requests
方法,因此替换start_urls
的所有创建用:
class spider(CrawlSpider):
...
def start_requests(self):
for x in range(1,20):
yield Request('http://www.example.com/page/' + str(x) +'/')
...