问题

Question

使用scrapy 1.5.0，python 2.7.14处理URL时遇到错误。

class GoodWillOutSpider(Spider):

name = "GoodWillOutSpider"
allowded_domains = ["thegoodwillout.com"]
start_urls = [GoodWillOutURL]

def __init__(self):
    logging.critical("GoodWillOut STARTED.")

def parse(self, response):
    products = Selector(response).xpath('//div[@id="elasticsearch-results-container"]/ul[@class="product-list clearfix"]')

    for product in products:
        item = GoodWillOutItem()
        item['name'] = product.xpath('//div[@class="name ng-binding"]').extract()[0]
        item['link'] = "www.thegoodwillout.com" + product.xpath('//@href').extract()[0]
        # item['image'] = "http:" + product.xpath("/div[@class='catalogue-product-cover']/a[@class='catalogue-product-cover-image']/img/@src").extract()[0]
        # item['size'] = '**NOT SUPPORTED YET**'
        yield item

    yield Request(GoodWillOutURL, callback=self.parse, dont_filter=True, priority=16)

这是我的GoodWillOutSpider类，这是我得到的错误：

[scrapy.core.scraper] ERROR: Spider error processing <GET https://www.thegoodwillout.com/footwear> (referer: None)

line 1085, in parse item['name'] = product.xpath('//div[@class="name ng-binding"]').extract()[0] IndexError: list index out of range

我想知道将来，如果不再问这里每个网站的正确xpath，我怎能得到

Answer 1

问题

如果您的抓取工具无法访问您使用浏览器开发人员工具可以看到的数据，则表明您的浏览器看不到相同的数据。

这可能意味着两件事之一：

您的刮刀正在被识别并提供不同的内容
部分内容是动态生成的（通常是通过javascript）

通用解决方案

解决这两个问题最直接的方法是使用实际的浏览器。

有许多无头浏览器可供选择，您可以根据需要选择最佳浏览器对于scrapy，scrapy-splash可能是最简单的选择。

更专业的解决方案

有时，您可以弄清楚这种不同行为的原因是什么，并更改您的代码这通常是更有效的解决方案，但可能需要您做更多的工作。

例如，如果您的刮刀被重定向，您可能只需要使用不同的用户代理字符串，传递一些额外的标头或减慢您的请求。

如果内容是由javascript生成的，您可以查看页面源（response.text或浏览器中的查看源代码），并弄清楚发生了什么。

之后，有两种可能性：

以其他方式提取数据（如上一个问题的gangabass）
复制您的蜘蛛代码中的javascript正在执行的操作（例如发出其他请求，如当前示例中所示）

Answer 2

IndexError：列表索引超出范围

在解压缩后，您需要首先检查列表是否有任何值

item['name'] = product.xpath('//div[@class="name ng-binding"]').extract()
if item['name']:
    item['name'] = item['name'][0]

蜘蛛错误URL处理

2 个答案:

问题

通用解决方案

更专业的解决方案