Question

我正在尝试抓取一个网站并获取产品的详细信息。有些产品有单位，有些则没有。结构如下所示：

对于拥有单位的产品：

<div class="unit">
<p>200ml</p>
</div>

对于没有单位的产品：

<div class = "unit">
    <p></p>
</div>

我的蜘蛛是这样的：

def product(self, response):
        products = response.xpath('descendant::*[@class="product_list_ul"]')
        item = Item()
        i = 0
        while i < 20:
            item['link'] = products.xpath(
                'descendant::*[@class="product-image"]//a/@href').extract()[i]
            item['name'] = products.xpath(
                'descendant::*[@class="product-name"]//a/@title').extract()[i]
            item['unit'] = products.xpath(
                    'descendant::*[@class="unit"]/p/text()').extract()[i]
            item['price'] = products.xpath(
                'descendant::*[@class="price"]/text()').extract()[i]
            item['image_url'] = products.xpath(
                'descendant::*[@class="product-image"]//a//img/@src').extract()[i]
            i += 1
            yield item

但是有一个问题。

products.xpath('descendant::*[@class="unit"]/p/text()').extract()

仅给出具有单位的结果。例如：如果有5个这样的产品：

p1：N / A

p2：200ml

p3：60gm

p4：5ml

p5：N / A

为此，我得到一个列表： [200ml，60gm，5ml] 。所以我最终得到＆＃34;索引超出范围错误＆＃34;

有人可以提出一种方法，我可以解决这个问题并获得 [N / A，200ml，60gm，5ml，N / A]

的列表

编辑：我已经通过做一些研究找到了一种方法，但问题是它只能在scrapy shell上运行。

[txt for item in sel.xpath('descendant::*[@class="litre"]/p') for txt in item.select('text()').extract() or [u'N/A']]

它给了我一个我想要的清单。我做了以下编辑，将其合并到我的python脚本中。

def unit_xpath(self, product):
        x = [txt for i in sel.xpath('descendant::*[@class="litre"]/p') for txt in i.select('text()').extract() or [u'n/a']]
        return x



def product(self, response):
     products = response.xpath('descendant::*[@class="product_list_ul"]')
     item = ForestessentialsItem()
     i = 0
     while i < 20:
         item['link'] = products.xpath('descendant::*[@class="product-image"]//a/@href').extract()[i]
         item['name'] = products.xpath('descendant::*[@class="product-name"]//a/@title').extract()[i]
         item['unit'] = self.unit_xpath(products)[i]
         item['price'] = products.xpath('descendant::*[@class="price"]/text()').extract()[i]
         item['image_url'] = products.xpath('descendant::*[@class="product-image"]//a//img/@src').extract()[i]
         i += 1
         yield item

我收到错误NameError: global name 'sel' is not defined。有人可以告诉我如何从这里开始

Answer 1

蜘蛛的逻辑存在轻微缺陷。通常，可以获得产品选择器的列表，并只是遍历它们。像这样：

def product(self, response):
    products = response.xpath('descendant::*[@class="product_list_ul"]')
    # [1] "//" is short for "descendant::" so you should use that instead
    products = response.xpath('//*[@class="product_list_ul"]')
    for prod in products:
        item = Item()
        item['link'] = prod.xpath('.//*[@class="product-image"]//a/@href').extract()
        item['name'] = prod.xpath('.//*[@class="product-name"]//a/@title').extract()
        item['unit'] = prod.xpath('.//*[@class="unit"]/p/text()').extract()
        item['price'] = prod.xpath('.//*[@class="price"]/text()').extract()
        item['image_url'] = prod.xpath('.//*[@class="product-image"]//a//img/@src').extract()
        yield item

如果你能提供一个网址，我可以提供一个更具体的例子。

[1] - 更多xpath快捷方式和说明：https://our.umbraco.org/wiki/reference/xslt/xpath-axes-and-their-shortcuts/

Answer 2

所以我找到了一种方法来做到这一点。

def unit_xpath(self, response):
        x = [txt for item in response.xpath('descendant::*[@class="unit"]/p') for txt in item.xpath('text()').extract() or [u'N/A']]
        return x

def product(self, response):
        products = response.xpath('descendant::*[@class="product_list_ul"]')
        item = Item()
        i = 0
        while i < 20:
            item['link'] = products.xpath(
                'descendant::*[@class="product-image"]//a/@href').extract()[i]
            item['name'] = products.xpath(
                'descendant::*[@class="product-name"]//a/@title').extract()[i]
            item['unit'] = products.xpath(
                    'descendant::*[@class="unit"]/p/text()').extract()[i]
            item['price'] = products.xpath(
                'descendant::*[@class="price"]/text()').extract()[i]
            item['image_url'] = products.xpath(
                'descendant::*[@class="product-image"]//a//img/@src').extract()[i]
            i += 1
            yield item

感谢所有帮助我的人。还要感谢@Granitosaurus ..我知道你在组中接近产品的方式更好，但这只是我的用例。

scrapy

2 个答案: