蜘蛛工作正常,但没有刮擦结果

时间:2019-08-30 05:55:50

标签: scrapy web-crawler

它工作正常,大约有208个产品信息,但是对于某些产品详细信息,它没有任何结果,我已经在scrapy shell中单独执行了这些产品链接,在那儿工作正常,但是为什么错过了25%的折扣详细信息?

我尝试过旋转用户代理,应用不同的xpath,但是徒劳。

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from ..items import AmazonItem
import time
from scrapy.linkextractors import LinkExtractor
import urllib.parse


class QuotesSpider(scrapy.Spider):
    name = 'pet'
    start_urls = ['https://www.amazon.co.uk/s?k=moleskine&rh=p_89%3AMoleskine&dc&qid=1567115653&rnid=1632651031&ref=sr_nr_p_89_1',
                  'https://www.amazon.co.uk/s?k=moleskine&rh=p_89%3AMoleskine&dc&page=2',
                  'https://www.amazon.co.uk/s?k=moleskine&rh=p_89%3AMoleskine&dc&page=3',
                  'https://www.amazon.co.uk/s?k=moleskine&rh=p_89%3AMoleskine&dc&page=4',
                  'https://www.amazon.co.uk/s?k=moleskine&rh=p_89%3AMoleskine&dc&page=5'
                  ]

def parse(self, response):
    links =response.xpath("//h2/a[contains(@href,'/dp')]/@href").extract()
    urll = ['https://www.amazon.co.uk' + link for link in links]
    urls = urll
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse_details)
def parse_details(self, response):
    global name1
    global sales_rank11
    global price1
    global prime1
    list = AmazonItem()
    name = response.xpath(".//*[(@id ='productTitle')]/text()").extract_first()
    if name is None:
        name1 = name
        self.logger.info('skip')
    else:
        name1 = name.replace('\n', '').strip()

    price = response.xpath("//span[@id='price_inside_buybox']/text()").get()
    if price is None:
        price1 = response.xpath("//span[@class='a-color-price']/text()").get()
        if price1 is None:
            price1 = 'No Price Avaiable'
        self.logger.info('skip')
    else:
        price1 = price.replace('\n', '').replace(' ','')

    prime = response.xpath("//span[@id='price-shipping-message']/b").get()
    if prime is None:
        prime1 = 'Not Prime'
    else:
        prime1 = 'Prime'
    sales_rank1 = response.xpath("//tr[@id='SalesRank']/td[@class='value']/text()").get()
    if sales_rank1 is None:
        sales_rank11 = 'No Sales Rank Available'
    else:
        sales_rank11 = sales_rank1.replace('(','').replace('\n','')
    list['Name'] = name1
    list['Price'] = price1
    list['SalesRank'] = sales_rank11
    list['Prime'] = prime1
    list['Url'] = response.url
    yield list

Box 2 conatins correct information, but box 1 doesn't have data, but if we go to the link, there's data there That's the product name of box 1's url, woking fine in scrapy, but not in spider.

我缺少什么吗?

0 个答案:

没有答案