Question

我尝试抓取TripAdvisor的一些数据。我有兴趣获得餐厅的“价格范围/美食和餐食”。

因此，我使用以下xpath提取同一类中的这3行：

response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()').extract()[1]

我直接在刮擦的外壳中进行测试，并且工作正常：

scrapy shell https://www.tripadvisor.com/Restaurant_Review-g187514-d15364769-Reviews-La_Gaditana_Castellana-Madrid.html

但是当我将其集成到脚本中时，出现以下错误：

    Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/root/Scrapy_TripAdvisor_Restaurant-master/tripadvisor_las_vegas/tripadvisor_las_vegas/spiders/res_las_vegas.py", line 64, in parse_listing
    (response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
  File "/usr/lib/python3.6/site-packages/parsel/selector.py", line 61, in __getitem__
    o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range

我将您的代码粘贴了一部分，并在下面进行了说明：

# extract restaurant cuisine
    row_cuisine_overviewcard = \
    (response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()')[1])
    row_cuisine_card = \
    (response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])


    if (row_cuisine_overviewcard == "CUISINES"):
        cuisine = \
        response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
    elif (row_cuisine_card == "CUISINES"):
        cuisine = \
        response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
    else:
        cuisine = None

在tripAdvisor餐馆中，有2种不同类型的页面，有2种不同格式。第一个带有课程概述卡，第二个带有课程概述卡

所以我想检查第一个（概述卡）是否存在，如果不存在，请执行第二个（概述卡），如果不存在，则输入“ None”值。

：D但是看起来Python都执行了..并且由于第二个不存在于页面中，因此脚本停止了。

可能是缩进错误吗？

感谢您的帮助问候

Answer 1

您的第二个选择器（row_cuisine_card）失败，因为该元素在页面上不存在。然后，当您尝试访问结果中的[1]时，由于结果数组为空，将引发错误。

假设您确实想要商品1，请尝试

row_cuisine_overviewcard = \
(response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()')[1])
# Here we get all the values, even if it is empty.
row_cuisine_card = \
(response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()').getall()) 


if (row_cuisine_overviewcard == "CUISINES"):
    cuisine = \
    response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
# Here we check first if that result has more than 1 item, and then we check the value.
elif (len(row_cuisine_card) > 1 and row_cuisine_card[1] == "CUISINES"):
    cuisine = \
    response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
else:
    cuisine = None

每当尝试从选择器中获取特定索引时，都应应用相同类型的安全检查。换句话说，在访问值之前，请确保您具有该值。

Answer 2

您已经在此行中检查您的问题_

row_cuisine_card = \
    (response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])

您正试图从网站中提取可能不存在的值。换句话说，如果

response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')

不返回任何元素或仅返回一个元素，那么您将无法访问返回列表中的第二个元素（您要使用附加的[1]访问该元素）。

我建议先将您从网站提取的值存储到本地变量中，然后检查是否已找到所需的值。我的猜测是，它所在的页面没有您想要的信息。

这大概看起来像下面的代码：

# extract restaurant cuisine
cuisine = None
cuisine_overviewcard_sections = response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()'
if len(cuisine_overviewcard_sections) >= 2:
    row_cuisine_overviewcard = cuisine_overviewcard_sections[1]
    cuisine_card_sections = response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()'
    if len(cuisine_card_sections) >= 2:
        row_cuisine_card = cuisine_card_sections[1]
        if (row_cuisine_overviewcard == "CUISINES"):
            cuisine = \
            response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
        elif (row_cuisine_card == "CUISINES"):
            cuisine = \
            response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]

由于只需要一部分信息，如果第一次XPath检查已经返回了正确的答案，则可以对代码进行一些美化：

# extract restaurant cuisine
cuisine = None
cuisine_overviewcard_sections = response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()'
if len(cuisine_overviewcard_sections) >= 2 and cuisine_overviewcard_sections[1] == "CUISINES":
    cuisine = \
            response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
else:
    cuisine_card_sections = response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()'
    if len(cuisine_card_sections) >= 2 and cuisine_card_sections[1] == "CUISINES":
        cuisine = \
            response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]

这样，您仅在实际必要时才进行XPath搜索（可能非常昂贵）。

Scrapy> IndexError：列表索引超出范围

2 个答案: