我是scbie的新手。 我想在this页面抓取产品。我的代码仅抓取第一页,而且大约15个产品也停止了。并且想要抓住下一页。有什么帮助吗?
这是我的班级
class AllyouneedSpider(CrawlSpider):
name = "allyouneed"
allowed_domains = ["de.allyouneed.com"]
start_urls = [ 'http://de.allyouneed.com/de/sportschuhe-/8799665488014/',]
rules = (
Rule(LxmlLinkExtractor(allow=(), restrict_xpaths='//*[@class="itm fst jf-lDiv"]//a[@href]'), callback='parse_obj', process_links="parse_filter") ,
Rule(LxmlLinkExtractor(restrict_xpaths='//*[@id="M62_searchhit"]//a[@href]')),
)
def parse_filter(self, links):
for link in links:
if self.allowed_domains[0] not in link.url:
pass # print link.url
# print links
return links
def parse_obj(self, response):
item = AllyouneedItem()
sel = scrapy.Selector(response)
item['url'] = []
url = response.selector.xpath('//*[@id="M62_searchhit"]//a[@href]').extract()
ti = response.selector.xpath('//span[@itemprop="name"]/text()').extract()
dec = response.selector.xpath('//div[@class="m-desc m-desc-t"]//text()').extract()
cat = response.selector.xpath('//span[@itemprop="title"]/text()').extract()
if ti:
item['title'] = ti
item['url'] = response.url
item['category'] = cat
item['decription'] = dec
print item
yield item
答案 0 :(得分:1)
使用restrict_xpaths=('//a[@class="nxtPge"]')
找到下一页的链接,无需找到所有链接,只需找到。您也不需要过滤URL,因为默认情况下scrapy会这样做。
Rule(LinkExtractor(allow=(), restrict_xpaths='//a[@class="nxtPge"]', callback='parse_obj')
您还可以通过删除选择器部分而不是初始化项目来简化parse_obj(),
item = AllyouneedItem()
url = response.xpath( etc...