使用SgmlLinkExtractor问题继续my question。
我正在尝试关注the pages from here 虽然它似乎正在工作并拉动所有必需的项目,但是在解析第3页之后爬行程序停止而没有任何错误消息。
class AltaSpider(CrawlSpider):
name = "altaCra"
allowed_domains = ["alta.ge"]
start_urls = [
"http://alta.ge/index.php?dispatch=categories.view&category_id=297"
]
rules = (Rule (SgmlLinkExtractor(allow=("index.php\?dispatch=categories.view&category_id=297&page=\d*", ))
, callback="parse_items", follow=True),)
def parse_items(self, response):
sel = Selector(response)
titles = sel.xpath('//table[@class="table products cl"]//tr[@valign="middle"]')
items = []
for t in titles:
item = AltaItem()
item["brand"] = t.xpath('td[@class="compact"]/div[@class="cl-prod-name"]/a/text()').re('^([\w\-]+)')
item["model"] = t.xpath('td[@class="compact"]/div[@class="cl-prod-name"]/a/text()').re('\s+(.*)$')
item["price"] = t.xpath('td[@class="cl-price-cont"]//span[4]/text()').extract()
items.append(item)
return(items)
答案 0 :(得分:3)
第一页中下一页的链接如下所示:
http://alta.ge/index.php?dispatch=categories.view&category_id=297&page=2
而下一页的链接则如下:
http://alta.ge/index.php?category_id=297&dispatch=categories.view&page=8
因此,我建议您使用其他规则,定位具有name="pagination"
属性的链接,这是所有下一页链接共享的属性:
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@name="pagination"]',)),
callback="parse_items", follow=True),
)
答案 1 :(得分:2)
以下规则(以及添加parse_start_urls
)将在不使用Ajax的情况下浏览8个可用页面。让我看看我是否可以让它工作,以便它遍历所有20页。
start_urls = [
"http://alta.ge/index.php?dispatch=categories.view&category_id=297"
]
rules = (Rule (SgmlLinkExtractor(allow=("index.php\?dispatch=categories.view&category_id=297&page=\d*", ))
, callback="parse_items", follow=True),)
def parse_start_url(self, response):
return self.parse_items(response)