scrapy-如果跟随无限网站,终止爬网

时间:2018-10-28 16:25:24

标签: python web-scraping scrapy scrapy-spider

假设我有一个this这样的网页。

counter.php     

if(isset($_GET['count'])){
    $count = intval($_GET['count']);
    $previous = $count - 1;
    $next = $count + 1;
    ?>
    <a href="?count=<?php echo $previous;?>">< Previous</a>

    Current: <?php echo $count;?>

    <a href="?count=<?php echo $next;?>">Next ></a>
    <?
}

?>

这是一个“无限”网站,因为您可以一直单击“下一步”以转到下一页(计数器只会增加)或上一页等。

但是,如果我要爬网此页面并像这样使用scrapy跟随链接,则scrapy将永远不会停止爬网。

蜘蛛示例:

urls = []  
class TestSpider(CrawlSpider):
        name = 'test'
        allowed_domains = ['example.com']
        start_urls = ['http://example.com/counter?count=1']


        rules = (
            Rule(LinkExtractor(), callback='parse_item', follow=True),
            )

        def parse_item(self, response):
            urls.append(response.url)

我可以使用哪种机制来确定我是否确实陷在无限的网站中并需要突围?

2 个答案:

答案 0 :(得分:0)

如果该页面上没有ITEMS或没有“下一页”按钮,则表示分页已结束,您可以随时进行分页

class TestSpider(CrawlSpider):
        name = 'test'
        allowed_domains = ['example.com']

        def start_requests(self):
            page = 1
            yield Request("http://example.com/counter?page=%s" % (page), meta={"page": page}, callback=self.parse_item)

        def parse_item(self, response):

            #METHOD 1: check if items availble on this page         
            items = response.css("li.items")

            if items:
                #Now go to next page
                page = int(response.meta['page']) + 1
                yield Request("http://example.com/counter?page=%s" % (page), meta={"page": page}, callback=self.parse_item)
            else:
                logging.info("%s was last page" % response.url)

            #METHOD 2: check if this page has NEXT PAGE button, most websites has that          
            nextPage = response.css("a.nextpage")

            if nextPage:
                #Now go to next page
                page = int(response.meta['page']) + 1
                yield Request("http://example.com/counter?page=%s" % (page), meta={"page": page}, callback=self.parse_item)
            else:
                logging.info("%s was last page" % response.url)

答案 1 :(得分:-1)

您不必在沙哑中使用Rule。您可以先逐页解析,然后迭代每页中的所有项目。或者,您可以收集每个页面中的所有项目链接。 例如:

urls = []
class TestSpider(CrawlSpider):
    name = 'test'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/counter?count=1']

    def parse(self, response):
        links = response.xpath('//a[@class="item"]/@href').extract()
        for link in links:
            yield Request(link, self.parse_item)
            # you can insert the item 's url here, so you dont have to yield to parse_item
            # urls.append(link)

        url, pg = response.url.split("=")# you can break infinite loop here
        if int(pg) <= 10: #We loop by page #10
            yield Request(url + "=" + str(int(pg) + 1), self.parse)

    def parse_item(self, response):
        urls.append(response.url)