亚马逊评论:列表索引超出范围

时间:2019-04-22 22:20:49

标签: scrapy

我想刮擦客户对亚马逊kindle paperwhite的评论。

我知道,尽管亚马逊可能会说有5900条评论,但最多只能访问5000条评论。 (在page = 500之后,将不再显示评论,每页10条评论。)

在最初的几页中,我的蜘蛛每页返回10条评论,但后来缩小到一两个。结果只有大约1300条评论。 将变量“ helpul”和“ verified”的数据相加似乎存在问题。两者都引发以下错误:

'helpful': ''.join(helpful[count]),
IndexError: list index out of range

任何帮助将不胜感激!

我尝试实现if语句,以防变量为空或包含列表,但是没有用。

我的蜘蛛amazon_reviews.py:

import scrapy
from scrapy.extensions.throttle import AutoThrottle

class AmazonReviewsSpider(scrapy.Spider):

    name = 'amazon_reviews'

    allowed_domains = ['amazon.com']

    myBaseUrl = "https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber="
    start_urls=[]

    # Creating list of urls to be scraped by appending page number a the end of base url
    for i in range(1,550):
        start_urls.append(myBaseUrl+str(i))

    def parse(self, response):
            data = response.css('#cm_cr-review_list')         

            # Collecting various data
            star_rating = data.css('.review-rating')
            title = data.css('.review-title')
            text = data.css('.review-text')
            date = data.css('.review-date'))
            # Number how many people thought the review was helpful.
            helpful = response.xpath('.//span[@data-hook="helpful-vote-statement"]//text()').extract()
            verified = response.xpath('.//span[@data-hook="avp-badge"]//text()').extract()
            # I scrape more information, but deleted it here not to make the code too big

            # yielding the scraped results
            for review in star_rating:
                yield{'ASIN': 'B07CXG6C9W',
                      #'ID': ''.join(id.xpath('.//text()').extract()),
                      'stars': ''.join(review.xpath('.//text()').extract_first()),
                      'title': ''.join(title[count].xpath(".//text()").extract_first()),
                      'text': ''.join(text[count].xpath(".//text()").extract_first()),
                      'date': ''.join(date[count].xpath(".//text()").extract_first()),

                  ### There seems to be a problem with adding these two, as I get 5000 reviews back if I delete them. ###
                      'verified purchase': ''.join(verified[count]),
                      'helpful': ''.join(helpful[count])

                      }
                count=count+1

我的settings.py:

AUTOTHROTTLE_ENABLED = True
CONCURRENT_REQUESTS = 2
DOWNLOAD_TIMEOUT = 180
REDIRECT_ENABLED = False
#DOWNLOAD_DELAY =5.0
RANDOMIZE_DOWNLOAD_DELAY = True

数据提取工作正常。我得到的评论具有完整而准确的信息。我获得的评论数量太少。

当我使用以下命令运行Spider时:

runspider amazon_reviews_scraping_test\amazon_reviews_scraping_test\spiders\amazon_reviews.py -o reviews.csv

控制台上的输出如下所示:

2019-04-22 11:54:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=164> (referer: None)
2019-04-22 11:54:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161>
{'ASIN': 'B07CXG6C9W', 'stars': '5.0 out of 5 stars', 'username': 'BRANDI', 'title': 'Bookworms rejoice!', 'text': "The (...) 5 STARS! ", 'date': 'December 7, 2018'}
2019-04-22 11:54:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161>
{'ASIN': 'B07CXG6C9W', 'stars': '5.0 out of 5 stars', 'username': 'Doug Stender', 'title': 'As good as adverised', 'text': 'I read (...) mazon...', 'date': 'January 8, 2019'}
2019-04-22 11:54:41 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161> (referer: None)
Traceback (most recent call last):
  File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
    for x in result:
  File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\John\OneDrive\Dokumente\Uni\05_SS 19\Masterarbeit\Code\Scrapy\amazon_reviews_scraping_test\amazon_reviews_scraping_test\spiders\amazon_reviews.py", line 78, in parse
    'helpful': ''.join(helpful[count]),
IndexError: list index out of range

1 个答案:

答案 0 :(得分:0)

事实证明,如果评论没有“ verified”标记,或者没有人评论,则scrapy所寻找的html部分不存在,因此没有项目添加到列表中,从而使“ verified” ”和“评论”列表要比其他列表短。由于此错误,列表中的所有项目均被删除,并且未添加到我的csv文件中。下面的简单修补程序可以检查列表是否与其他列表一样长:)

编辑: 使用此修复程序时,可能会将值分配给错误的审阅,因为它总是被添加到列表的末尾。 为了安全起见,请不要刮擦经过验证的标签,也不要将整个列表替换为“ Na”或其他表示该值不清楚的内容。

helpful = response.xpath('.//span[@data-hook="helpful-vote-statement"]//text()').extract()
while len(helpful) != len(date):
                helpful.append("0 people found this helpful")