我使用scrapy crawlspider抓取http://www.sephora.com/lipstick。
我应该如何设置LinkExtractor
以废弃所有页面?“
class SephoraSpider(CrawlSpider):
name = "sephora"
# custom_settings = {"IMAGES_STORE": '../images/sephora'}
# allowed_domains = ["sephora.com/"]
start_urls = [
'http://www.sephora.com/lipstick'
# 'http://www.sephora.com/eyeshadow',
# 'http://www.sephora.com/foundation-makeup'
]
rules = (Rule(LinkExtractor(
# restrict_xpaths='//*[@id="main"]/div[4]/div[5]/div[1]/div/div[2]/div[3]/div[7]',
allow=('sephora.com/')
),
callback = 'parse_items',
follow =True),)
def parse(self,response):
# category = ['lipstick']
# for cat in category:
full_url = 'http://www.sephora.com/rest/products/?currentPage=1&categoryName=lipstick&include_categories=true&include_refinements=true'
my_request = scrapy.Request(full_url, callback = 'parse_items')
my_request.meta['page'] = {'to_replace':"currentPage=1"}
yield my_request
def parse_items(self,response):
# cat_json = response.xpath('//script[@id="searchResult"]/text()').extract_first()
# all_url_data = json.loads(cat_json.encode('utf-8'))
# if "products" not in all_url_data:
# return
# products = all_url_data['products']
products = json.loads(response.body)['products']
print(products)
for each_product in products:
link = each_product['product_url']
full_url = "http://www.sephora.com"+link
name = each_product["display_name"]
if 'list_price' not in each_product['derived_sku']:
price = each_product['derived_sku']['list_price_max']
else:
price = each_product['derived_sku']["list_price"]
brand = each_product["brand_name"]
item = ProductItem(
name=name,
price=price,
brand=brand,
full_url=full_url,
category=response.url[23:])
yield item
to_replace = response.meta['page']['to_replace']
cat = response.meta['page']['category']
next_number = int(to_replace.replace("currentPage=", "")) + 1
next_link = response.url.replace(
to_replace, "currentPage=" + str(next_number))
print(next_link)
my_request = scrapy.Request(
next_link,
self.parse_items)
my_request.meta['page'] = {
"to_replace": "currentPage=" + str(next_number),
}
yield my_request
我现在有这个错误。
2017-06-12 12:43:30 [scrapy] DEBUG: Crawled (200) <GET http://www.sephora.com/rest/products/?currentPage=1&categoryName=lipstick&include_categories=true&include_refinements=true> (referer: http://www.sephora.com/makeup-cosmetics)
2017-06-12 12:43:30 [scrapy] ERROR: Spider error processing <GET http://www.sephora.com/rest/products/?currentPage=1&categoryName=lipstick&include_categories=true&include_refinements=true> (referer: http://www.sephora.com/makeup-cosmetics)
Traceback (most recent call last):
File "/Users/Lee/anaconda/lib/python2.7/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
result = f(*args, **kw)
File "/Users/Lee/anaconda/lib/python2.7/site-packages/scrapy/core/spidermw.py", line 48, in process_spider_input
return scrape_func(response, request, spider)
File "/Users/Lee/anaconda/lib/python2.7/site-packages/scrapy/core/scraper.py", line 145, in call_spider
dfd.addCallbacks(request.callback or spider.parse, request.errback)
File "/Users/Lee/anaconda/lib/python2.7/site-packages/twisted/internet/defer.py", line 299, in addCallbacks
assert callable(callback)
AssertionError
2017-06-12 12:43:30 [scrapy] INFO: Closing spider (finished)
答案 0 :(得分:2)
简短回答:不要。
答案很长:我的处理方式不同。分页链接不会返回新页面。相反,他们会向此网址发送GET
- 请求:
http://www.sephora.com/rest/products/?currentPage=2&categoryName=lipstick&include_categories=true&include_refinements=true
。
您可以在此处查看浏览器发出的请求和响应。在这种情况下,单击paginatino链接会生成一个JSON对象,其中包含页面上显示的所有产品。
现在查看请求的Response
- 标签。在products
下,您可以看到0到59之间的数字,这些是页面上显示的产品,以及产品的所有信息,例如id
,display_name
和哦,url
。
尝试右键单击请求并选择Open in a new tab
以在浏览器中查看响应。现在尝试将sephora主页上的items per page
设置为不同的内容。你看到会发生什么? JSON对象现在返回更少或更多的项目(取决于您选择的内容)。
那么我们现在如何处理这些信息呢?
理想情况下,我们可以直接在我们的蜘蛛中为每个页面请求JSON对象(通过简单地将请求URL从current_page=2
更改为current_page=3
)并遵循其中提供的URL(在products/n-product/product_url
,然后抓取单个对象(或者只提取产品列表,如果这是你想要的)。
幸运的是,Scrapy(更好的是,Python)允许您解析JSON对象并使用解析的数据执行任何操作。幸运的是,Sephora允许您选择显示每页的所有项目,将请求网址更改为?pageSize=-1
。
您所做的是yield
对url的请求,该请求产生JSON对象并定义一个parse
- 处理该对象的函数。
这是一个快速示例,它将提取每个产品的网址并向此网址发出请求(稍后我会尝试提供更详细的示例):
import json
data = json.loads(response.body)
for product in data["products"]:
url = response.urljoin(product["product_url"])
yield scrapy.Request(url=url, callback=self.parse_products)
你有它。学习向网站提出请求确实是值得的,因为您可以轻松操纵请求网址以使您的生活更轻松。例如,您可以更改URL中的categoryName
以解析另一个类别。