我正在尝试使用scrapy刮取amazon.com,但未获取产品的名称和价格,而仅获取该产品的评级

时间:2019-05-28 09:09:24

标签: web-scraping amazon

我正在使用scrapy从amazon.com抓取手机名称,价格和评级,但只能在名称和价格前获取评级和空白列表。可能是什么错误?

以下是代码:

import scrapy

class AmazonItem(scrapy.Item):
    name=scrapy.Field()
    price=scrapy.Field()
    rating=scrapy.Field()
    pass

class myspider(scrapy.Spider):
   name="amazon_spider"
    def start_requests(self):
        urls=[
            "https://www.amazon.in/s?k=samsung"
            ]
        for url in urls:
            yield scrapy.Request(url=url,callback=self.parse)


    def parse(self, response):
        items=AmazonItem()

        name = response.css('span.a-size-medium a-color-base a-text- 
        normal::text').extract()
        price = response.css('span.a-price-whole::text').extract()
        rating = response.css('span.a-icon-alt::text').extract()

        items['name']=name
        items['price']=price
        items['rating']=rating

        yield items

这就是我得到的结果:

2019-05-28 14:50:32 [scrapy.utils.log] INFO: Scrapy 1.5.2 started
(bot: amazon) 2019-05-28 14:50:33 [scrapy.utils.log] INFO: Versions:
lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib
1.20.0, Twisted 18.7.0, Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL
1.1.1b  26 Feb
2019), cryptography 2.6.1, Platform Windows-10-10.0.17134-SP0
2019-05-28 14:50:33 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'amazon', 'FEED_FORMAT': 'json', 'FEED_URI':
'amazon.json', 'NEWSPIDER_MODULE': 'amazon.spiders', 'ROBOTSTXT_OBEY':
True, 'SPIDER_MODULES': ['amazon.spiders'], 'USER_AGENT': 'Mozilla/5.0
AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1;
+http://www.google.com/bot.html) Safari/537.36'} 2019-05-28 14:50:33 [scrapy.extensions.telnet] INFO: Telnet Password: 2423d32d709a9f10
2019-05-28 14:50:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.feedexport.FeedExporter', 
'scrapy.extensions.logstats.LogStats'] 2019-05-28 14:50:33
[scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-05-28
14:50:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-05-28 14:50:33
[scrapy.middleware] INFO: Enabled item pipelines: [] 2019-05-28
14:50:33 [scrapy.core.engine] INFO: Spider opened 2019-05-28 14:50:33
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min) 2019-05-28 14:50:33
[scrapy.extensions.telnet] DEBUG: Telnet console listening on
127.0.0.1:6023 2019-05-28 14:50:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.in/robots.txt> (referer: None)
2019-05-28 14:50:34 [scrapy.downloadermiddlewares.redirect] DEBUG:
Redirecting (301) to <GET
https://www.amazon.in/samsung/s?ie=UTF8&page=1&rh=i%3Aaps%2Ck%3Asamsung>
from <GET https://www.amazon.in/s?k=samsung> 2019-05-28 14:50:35
[scrapy.core.engine] DEBUG: Crawled (200) <GET
https://www.amazon.in/samsung/s?ie=UTF8&page=1&rh=i%3Aaps%2Ck%3Asamsung>
(referer: None) 2019-05-28 14:50:35 [scrapy.core.scraper] DEBUG:
Scraped from <200
https://www.amazon.in/samsung/s?ie=UTF8&page=1&rh=i%3Aaps%2Ck%3Asamsung>
{'name': [],  'price': [],  'rating': ['3.1 out of 5 stars',
            '3.1 out of 5 stars',
            '3.1 out of 5 stars',
            '3.9 out of 5 stars',
            '4 out of 5 stars',
            '3.9 out of 5 stars',
            '3.6 out of 5 stars',
            '4.1 out of 5 stars',
            '4 out of 5 stars',
            '3.8 out of 5 stars',
            '4 out of 5 stars',
            '4 out of 5 stars',
            '3.9 out of 5 stars',
            '3.5 out of 5 stars',
            '3.6 out of 5 stars',
            '3.9 out of 5 stars',
            '4 Stars & Up',
            '3 Stars & Up',
            '2 Stars & Up',
            '1 Star & Up']} 2019-05-28 14:50:35 [scrapy.core.engine] INFO: Closing spider (finished) 2019-05-28 14:50:35
[scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in:
amazon.json 2019-05-28 14:50:35 [scrapy.statscollectors] INFO: Dumping
Scrapy stats: {'downloader/request_bytes': 976, 
'downloader/request_count': 3,  'downloader/request_method_count/GET':
3,  'downloader/response_bytes': 72313,  'downloader/response_count':
3,  'downloader/response_status_count/200': 2, 
'downloader/response_status_count/301': 1,  'finish_reason':
'finished',  'finish_time': datetime.datetime(2019, 5, 28, 9, 20, 35,
442431),  'item_scraped_count': 1,  'log_count/DEBUG': 5, 
'log_count/INFO': 9,  'response_received_count': 2, 
'scheduler/dequeued': 2,  'scheduler/dequeued/memory': 2, 
'scheduler/enqueued': 2,  'scheduler/enqueued/memory': 2, 
'start_time': datetime.datetime(2019, 5, 28, 9, 20, 33, 476571)}
2019-05-28 14:50:35 [scrapy.core.engine] INFO: Spider closed
(finished)

0 个答案:

没有答案