关于Scrapy的混乱重新直接行为?

时间:2017-09-23 00:25:46

标签: python web-scraping scrapy

所以我试图从具有无限卷轴类型布局的新闻网站上抓取文章,以便发生以下情况:

example.com有第一页文章

example.com/page/2/有第二页

example.com/page/3/有第三页

等等。向下滚动时,网址会发生变化。为了解释这一点,我想抓第一篇x篇文章,并做了以下几点:

start_urls = ['http://example.com/']
for x in range(1,x):
    new_url  = 'http://www.example.com/page/' + str(x) +'/'
    start_urls.append(new_url)

对于前9页似乎工作正常,我得到如下内容:

Redirecting (301) to <GET http://example.com/page/4/> from <GET http://www.example.com/page/4/>
Redirecting (301) to <GET http://example.com/page/5/> from <GET http://www.example.com/page/5/>
Redirecting (301) to <GET http://example.com/page/6/> from <GET http://www.example.com/page/6/>
Redirecting (301) to <GET http://example.com/page/7/> from <GET http://www.example.com/page/7/>
2017-09-08 17:36:23 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
Redirecting (301) to <GET http://example.com/page/8/> from <GET http://www.example.com/page/8/>
Redirecting (301) to <GET http://example.com/page/9/> from <GET http://www.example.com/page/9/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/10/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/11/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/12/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/13/>

从第10页开始,它会从example.com/重定向到example.com/page/10/这样的网页,而不是原始链接example.com/page/10。什么可能导致这种行为?

我研究了dont_redirect之类的几个选项,但我不了解发生了什么。这种重新定向行为的原因是什么?特别是当您直接输入example.com/page/10等网站的链接时,没有重定向?

任何帮助将不胜感激,谢谢!

[编辑]

class spider(CrawlSpider):
    start_urls = ['http://example.com/']

    for x in range(startPage,endPage):
        new_url  = 'http://www.example.com/page/' + str(x) +'/'
        start_urls.append(new_url)
   custom_settings = {'DEPTH_PRIORITY': 1, 'DEPTH_LIMIT': 1}


rules = (
    Rule(LinkExtractor(allow=('some regex here,')deny=('example\.com/page/.*','some other regex',),callback='parse_article'),
)

def parse_article(self, response):
    #some parsing work here 
    yield item

是因为我在example\.com/page/.*中加入了LinkExtractor吗?那不应该仅适用于不是start_url的链接吗?

1 个答案:

答案 0 :(得分:1)

看起来此网站使用某种安全性来检查请求标头中的User-Agent

因此,您只需在User-Agent文件中添加公共settings.py

USER_AGENT = 'Mozilla/5.0'

此外,蜘蛛不一定需要start_urls属性来获取起始网站,您也可以使用start_requests方法,因此替换start_urls的所有创建用:

class spider(CrawlSpider):

    ...

    def start_requests(self):
        for x in range(1,20):
            yield Request('http://www.example.com/page/' + str(x) +'/')

    ...