它工作正常,大约有208个产品信息,但是对于某些产品详细信息,它没有任何结果,我已经在scrapy shell中单独执行了这些产品链接,在那儿工作正常,但是为什么错过了25%的折扣详细信息?
我尝试过旋转用户代理,应用不同的xpath,但是徒劳。
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from ..items import AmazonItem
import time
from scrapy.linkextractors import LinkExtractor
import urllib.parse
class QuotesSpider(scrapy.Spider):
name = 'pet'
start_urls = ['https://www.amazon.co.uk/s?k=moleskine&rh=p_89%3AMoleskine&dc&qid=1567115653&rnid=1632651031&ref=sr_nr_p_89_1',
'https://www.amazon.co.uk/s?k=moleskine&rh=p_89%3AMoleskine&dc&page=2',
'https://www.amazon.co.uk/s?k=moleskine&rh=p_89%3AMoleskine&dc&page=3',
'https://www.amazon.co.uk/s?k=moleskine&rh=p_89%3AMoleskine&dc&page=4',
'https://www.amazon.co.uk/s?k=moleskine&rh=p_89%3AMoleskine&dc&page=5'
]
def parse(self, response):
links =response.xpath("//h2/a[contains(@href,'/dp')]/@href").extract()
urll = ['https://www.amazon.co.uk' + link for link in links]
urls = urll
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_details)
def parse_details(self, response):
global name1
global sales_rank11
global price1
global prime1
list = AmazonItem()
name = response.xpath(".//*[(@id ='productTitle')]/text()").extract_first()
if name is None:
name1 = name
self.logger.info('skip')
else:
name1 = name.replace('\n', '').strip()
price = response.xpath("//span[@id='price_inside_buybox']/text()").get()
if price is None:
price1 = response.xpath("//span[@class='a-color-price']/text()").get()
if price1 is None:
price1 = 'No Price Avaiable'
self.logger.info('skip')
else:
price1 = price.replace('\n', '').replace(' ','')
prime = response.xpath("//span[@id='price-shipping-message']/b").get()
if prime is None:
prime1 = 'Not Prime'
else:
prime1 = 'Prime'
sales_rank1 = response.xpath("//tr[@id='SalesRank']/td[@class='value']/text()").get()
if sales_rank1 is None:
sales_rank11 = 'No Sales Rank Available'
else:
sales_rank11 = sales_rank1.replace('(','').replace('\n','')
list['Name'] = name1
list['Price'] = price1
list['SalesRank'] = sales_rank11
list['Prime'] = prime1
list['Url'] = response.url
yield list
Box 2 conatins correct information, but box 1 doesn't have data, but if we go to the link, there's data there That's the product name of box 1's url, woking fine in scrapy, but not in spider.
我缺少什么吗?