如何使用crawlspider跳转到下一页?

时间:2017-06-12 06:59:35

标签: python scrapy

我使用scrapy crawlspider抓取http://www.sephora.com/lipstick。 我应该如何设置LinkExtractor以废弃所有页面?“

class SephoraSpider(CrawlSpider):
name = "sephora"
# custom_settings = {"IMAGES_STORE": '../images/sephora'}

# allowed_domains = ["sephora.com/"]

start_urls = [
    'http://www.sephora.com/lipstick'
    # 'http://www.sephora.com/eyeshadow',
    # 'http://www.sephora.com/foundation-makeup'
]

rules = (Rule(LinkExtractor(
            # restrict_xpaths='//*[@id="main"]/div[4]/div[5]/div[1]/div/div[2]/div[3]/div[7]',
            allow=('sephora.com/')
            ), 
        callback = 'parse_items',
        follow =True),)

def parse(self,response):
    # category = ['lipstick']
    # for cat in category:
    full_url = 'http://www.sephora.com/rest/products/?currentPage=1&categoryName=lipstick&include_categories=true&include_refinements=true'
    my_request = scrapy.Request(full_url, callback = 'parse_items')
    my_request.meta['page'] = {'to_replace':"currentPage=1"}
    yield my_request

def parse_items(self,response):

    # cat_json = response.xpath('//script[@id="searchResult"]/text()').extract_first()
    # all_url_data = json.loads(cat_json.encode('utf-8'))
    # if "products" not in all_url_data:
    #     return
    # products = all_url_data['products']
    products = json.loads(response.body)['products']
    print(products)
    for each_product in products:
        link = each_product['product_url']
        full_url = "http://www.sephora.com"+link
        name = each_product["display_name"]
        if 'list_price' not in each_product['derived_sku']:
            price = each_product['derived_sku']['list_price_max']
        else:
            price = each_product['derived_sku']["list_price"]
        brand = each_product["brand_name"]
        item = ProductItem(
            name=name,
            price=price,
            brand=brand,
            full_url=full_url,
            category=response.url[23:])
        yield item

    to_replace = response.meta['page']['to_replace']
    cat = response.meta['page']['category']
    next_number = int(to_replace.replace("currentPage=", "")) + 1
    next_link = response.url.replace(
        to_replace, "currentPage=" + str(next_number))
    print(next_link)
    my_request = scrapy.Request(
        next_link,
        self.parse_items)
    my_request.meta['page'] = {
        "to_replace": "currentPage=" + str(next_number),

    }
    yield my_request

我现在有这个错误。

    2017-06-12 12:43:30 [scrapy] DEBUG: Crawled (200) <GET http://www.sephora.com/rest/products/?currentPage=1&categoryName=lipstick&include_categories=true&include_refinements=true> (referer: http://www.sephora.com/makeup-cosmetics)
2017-06-12 12:43:30 [scrapy] ERROR: Spider error processing <GET http://www.sephora.com/rest/products/?currentPage=1&categoryName=lipstick&include_categories=true&include_refinements=true> (referer: http://www.sephora.com/makeup-cosmetics)
Traceback (most recent call last):
  File "/Users/Lee/anaconda/lib/python2.7/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
    result = f(*args, **kw)
  File "/Users/Lee/anaconda/lib/python2.7/site-packages/scrapy/core/spidermw.py", line 48, in process_spider_input
    return scrape_func(response, request, spider)
  File "/Users/Lee/anaconda/lib/python2.7/site-packages/scrapy/core/scraper.py", line 145, in call_spider
    dfd.addCallbacks(request.callback or spider.parse, request.errback)
  File "/Users/Lee/anaconda/lib/python2.7/site-packages/twisted/internet/defer.py", line 299, in addCallbacks
    assert callable(callback)
AssertionError
2017-06-12 12:43:30 [scrapy] INFO: Closing spider (finished)

1 个答案:

答案 0 :(得分:2)

简短回答:不要。

答案很长:我的处理方式不同。分页链接不会返回新页面。相反,他们会向此网址发送GET - 请求:

http://www.sephora.com/rest/products/?currentPage=2&categoryName=lipstick&include_categories=true&include_refinements=true

检查您的网络标签,然后点击分页链接:Networks-Tab

您可以在此处查看浏览器发出的请求和响应。在这种情况下,单击paginatino链接会生成一个JSON对象,其中包含页面上显示的所有产品。

现在查看请求的Response - 标签。在products下,您可以看到0到59之间的数字,这些是页面上显示的产品,以及产品的所有信息,例如iddisplay_name和哦,url

尝试右键单击请求并选择Open in a new tab以在浏览器中查看响应。现在尝试将sephora主页上的items per page设置为不同的内容。你看到会发生什么? JSON对象现在返回更少或更多的项目(取决于您选择的内容)。

那么我们现在如何处理这些信息呢?

理想情况下,我们可以直接在我们的蜘蛛中为每个页面请求JSON对象(通过简单地将请求URL从current_page=2更改为current_page=3)并遵循其中提供的URL(在products/n-product/product_url,然后抓取单个对象(或者只提取产品列表,如果这是你想要的)。

幸运的是,Scrapy(更好的是,Python)允许您解析JSON对象并使用解析的数据执行任何操作。幸运的是,Sephora允许您选择显示每页的所有项目,将请求网址更改为?pageSize=-1

您所做的是yield对url的请求,该请求产生JSON对象并定义一个parse - 处理该对象的函数。

这是一个快速示例,它将提取每个产品的网址并向此网址发出请求(稍后我会尝试提供更详细的示例):

import json

data = json.loads(response.body)
for product in data["products"]:
    url = response.urljoin(product["product_url"])
    yield scrapy.Request(url=url, callback=self.parse_products)

你有它。学习向网站提出请求确实是值得的,因为您可以轻松操纵请求网址以使您的生活更轻松。例如,您可以更改URL中的categoryName以解析另一个类别。