Scrapy SgmlLinkExtractor在10页后停止

时间:2014-06-16 21:26:01

标签: python web-crawler scrapy

目前我对Sgmllinkextractor的规则如下:

     rules = (Rule (SgmlLinkExtractor(allow=("/boards/recentnews.aspx", ),restrict_xpaths=        ('//*[text()[contains(.,"Next")]]'))
        , callback="parse_start_url", follow= True),
        )

我希望scrapy在到达第10页后停止爬行,所以我认为它会是这样的:

     rules = (Rule (SgmlLinkExtractor(allow=("/boards/recentnews.aspx?page=\d*", ),restrict_xpaths=        ('//*[text()[contains(.,"Next")]]'))
        , callback="parse_start_url", follow= True),
        )

但我不知道怎么做,规则适用于1-10。

1 个答案:

答案 0 :(得分:0)

您可以在回调中执行此操作:

def parse_start_url(response):
    page_number = int(re.search('page=(\d+)', response.url).group(1))
    if page_number > 10:
        raise CloseSpider('page number limit exceeded')
    # scrape the data

这里包含正则表达式的行:

>>> import re
>>> url = "http://example.com/boards/recentnews.aspx?page=9"
>>> re.search('page=(\d+)', url).group(1)
'9'
>>> url = "http://example.com/boards/recentnews.aspx?page=10"
>>> re.search('page=(\d+)', url).group(1)
'10'