Question

所以我试图从具有无限卷轴类型布局的新闻网站上抓取文章，以便发生以下情况：

example.com有第一页文章

example.com/page/2/有第二页

example.com/page/3/有第三页

等等。向下滚动时，网址会发生变化。为了解释这一点，我想抓第一篇x篇文章，并做了以下几点：

start_urls = ['http://example.com/']
for x in range(1,x):
    new_url  = 'http://www.example.com/page/' + str(x) +'/'
    start_urls.append(new_url)

对于前9页似乎工作正常，我得到如下内容：

Redirecting (301) to <GET http://example.com/page/4/> from <GET http://www.example.com/page/4/>
Redirecting (301) to <GET http://example.com/page/5/> from <GET http://www.example.com/page/5/>
Redirecting (301) to <GET http://example.com/page/6/> from <GET http://www.example.com/page/6/>
Redirecting (301) to <GET http://example.com/page/7/> from <GET http://www.example.com/page/7/>
2017-09-08 17:36:23 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
Redirecting (301) to <GET http://example.com/page/8/> from <GET http://www.example.com/page/8/>
Redirecting (301) to <GET http://example.com/page/9/> from <GET http://www.example.com/page/9/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/10/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/11/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/12/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/13/>

从第10页开始，它会从example.com/重定向到example.com/page/10/这样的网页，而不是原始链接example.com/page/10。什么可能导致这种行为？

我研究了dont_redirect之类的几个选项，但我不了解发生了什么。这种重新定向行为的原因是什么？特别是当您直接输入example.com/page/10等网站的链接时，没有重定向？

任何帮助将不胜感激，谢谢！

[编辑]

class spider(CrawlSpider):
    start_urls = ['http://example.com/']

    for x in range(startPage,endPage):
        new_url  = 'http://www.example.com/page/' + str(x) +'/'
        start_urls.append(new_url)
   custom_settings = {'DEPTH_PRIORITY': 1, 'DEPTH_LIMIT': 1}


rules = (
    Rule(LinkExtractor(allow=('some regex here,')deny=('example\.com/page/.*','some other regex',),callback='parse_article'),
)

def parse_article(self, response):
    #some parsing work here 
    yield item

是因为我在example\.com/page/.*中加入了LinkExtractor吗？那不应该仅适用于不是start_url的链接吗？

Answer 1

看起来此网站使用某种安全性来检查请求标头中的User-Agent。

因此，您只需在User-Agent文件中添加公共settings.py：

USER_AGENT = 'Mozilla/5.0'

此外，蜘蛛不一定需要start_urls属性来获取起始网站，您也可以使用start_requests方法，因此替换start_urls的所有创建用：

class spider(CrawlSpider):

    ...

    def start_requests(self):
        for x in range(1,20):
            yield Request('http://www.example.com/page/' + str(x) +'/')

    ...

关于Scrapy的混乱重新直接行为？

1 个答案: