SgmlLinkExtractor在第3页停止

时间:2014-04-15 22:09:52

标签: python-2.7 scrapy

使用SgmlLinkExtractor问题继续my question

我正在尝试关注the pages from here 虽然它似乎正在工作并拉动所有必需的项目,但是在解析第3页之后爬行程序停止而没有任何错误消息。

class AltaSpider(CrawlSpider):
    name = "altaCra"
    allowed_domains = ["alta.ge"]
    start_urls = [
    "http://alta.ge/index.php?dispatch=categories.view&category_id=297"
    ]

    rules = (Rule (SgmlLinkExtractor(allow=("index.php\?dispatch=categories.view&category_id=297&page=\d*", ))
        , callback="parse_items", follow=True),)

    def parse_items(self, response):
        sel = Selector(response)
        titles = sel.xpath('//table[@class="table products cl"]//tr[@valign="middle"]')
        items = []
        for t in titles:
            item = AltaItem()
            item["brand"] = t.xpath('td[@class="compact"]/div[@class="cl-prod-name"]/a/text()').re('^([\w\-]+)')    
            item["model"] = t.xpath('td[@class="compact"]/div[@class="cl-prod-name"]/a/text()').re('\s+(.*)$')
            item["price"] = t.xpath('td[@class="cl-price-cont"]//span[4]/text()').extract()

            items.append(item)

    return(items)   

2 个答案:

答案 0 :(得分:3)

第一页中下一页的链接如下所示:

http://alta.ge/index.php?dispatch=categories.view&category_id=297&page=2

而下一页的链接则如下:

http://alta.ge/index.php?category_id=297&dispatch=categories.view&page=8

因此,我建议您使用其他规则,定位具有name="pagination"属性的链接,这是所有下一页链接共享的属性:

rules = (
    Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@name="pagination"]',)),
         callback="parse_items", follow=True),
)

答案 1 :(得分:2)

以下规则(以及添加parse_start_urls)将在不使用Ajax的情况下浏览8个可用页面。让我看看我是否可以让它工作,以便它遍历所有20页。

start_urls = [
    "http://alta.ge/index.php?dispatch=categories.view&category_id=297"
    ]

    rules = (Rule (SgmlLinkExtractor(allow=("index.php\?dispatch=categories.view&category_id=297&page=\d*", ))
        , callback="parse_items", follow=True),)


    def parse_start_url(self, response):
       return self.parse_items(response)