我必须在此网站上抓取数据(名称、价格、描述、品牌...):https://www.asos.com/women/new-in/new-in-clothing/cat/?cid=2623&nlid=ww%7Cnew+in%7Cnew+products%7Cclothing
我的代码是这样的:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class TestcrawlSpider(CrawlSpider):
name = 'testcrawl'
def remove_characters(self,value):
return value.strip('\n')
allowed_domains = ['www.asos.com']
start_urls = ['https://www.asos.com/women/new-in/new-in-clothing/cat/?cid=2623&nlid=ww|new+in|new+products|clothing']
rules = (
Rule(LinkExtractor(restrict_xpaths="//article[@class='_2qG85dG']/a"), callback='parse_item', follow=True),
Rule(LinkExtractor(restrict_xpaths="//a[@class='_39_qNys']")),
)
def parse_item(self, response):
yield{
'name':response.xpath("//div[@class='product-hero']/h1/text()").get(),
'price':response.xpath("//span[@data-id='current-price']").get(),
'description':response.xpath("//div[@class='product-description']/ul/li/text()").getall(),
'about_me': response.xpath("//div[@class='about-me']//text()").getall(),
'brand_description':response.xpath("//div[@class='brand-description']/p/text()").getall()
}
但是,由于 javascript,我无法获得价格。我需要通过 XHR 获得它。 我获取列表中只有一件商品的价格的代码如下:
import scrapy
import json
class AsosSpider(scrapy.Spider):
name = 'asos'
allowed_domains = ['www.asos.com']
start_urls = ['https://www.asos.com/api/product/catalogue/v3/stockprice?productIds=200369183&store=ROW¤cy=GBP&keyStoreDataversion=hnm9sjt-28']
def parse(self, response):
#print(response.body)
resp = json.loads(response.text)[0]
price = resp.get('productPrice').get('current').get('text')
print(price)
yield {
'price': price
这里,我的 start_urls 是请求 URL。并且每一项都在不断变化。
只有 productsIds 发生了变化!!!
我需要在第一个代码中插入第二个代码才能获得价格吗?请问怎么做?
谢谢!
答案 0 :(得分:0)
items.py:
import scrapy
class AsosItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
description = scrapy.Field()
about_me = scrapy.Field()
brand_description = scrapy.Field()
正如我在上一篇文章中所说的,由于某种原因,我的计算机上的此网站出现问题,但您需要执行以下操作:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import AsosItem
class TestcrawlSpider(CrawlSpider):
name = 'testcrawl'
allowed_domains = ['www.asos.com']
start_urls = ['https://www.asos.com/women/new-in/new-in-clothing/cat/?cid=2623&nlid=ww|new+in|new+products|clothing']
rules = (
Rule(LinkExtractor(restrict_xpaths="//article[@class='_2qG85dG']/a"), callback='parse_item', follow=True),
Rule(LinkExtractor(restrict_xpaths="//a[@class='_39_qNys']")),
)
def remove_characters(self,value):
return value.strip('\n')
def parse_item(self, response):
price_url = 'https://www.asos.com' + re.search(r'window.asos.pdp.config.stockPriceApiUrl = \'(.+)\'', response.text).group(1)
item = AsosItem()
item['name'] = response.xpath("//div[@class='product-hero']/h1/text()").get()
item['description'] = response.xpath("//div[@class='product-description']/ul/li/text()").getall()
item['about_me'] = response.xpath("//div[@class='about-me']//text()").getall()
item['brand_description'] = response.xpath("//div[@class='brand-description']/p/text()").getall()
request = scrapy.Request(url=price_url, callback=self.parse_price)
request.meta['item'] = item
return request
def parse_price(self, response):
jsonresponse = response.json()[0]
price = jsonresponse['productPrice']['current']['text']
item = response.meta['item']
item['price'] = price
return item
测试代码,如果它不起作用,那么了解总体思路并稍微调整一下,我无法自己测试。