Question

我在Scrapy中有两个功能

def parse_attr(self, response):
      for resource in response.xpath(''):
        item = Item()
        item['Name'] = response.xpath('').extract()
        item['Title'] = response.xpath('').extract()
        item['Contact'] = response.xpath('').extract()
        item['Gold'] = response.xpath('').extract() 
        company_page = response.urljoin(resource.xpath('/div/@href').extract_first()) 

        if company_page:
            request = scrapy.Request(company_page, callback = self.company_data)
            request.meta['item'] = item
            yield request
        else:
            yield item

    def company_data(self, response):
        item = response.meta['item']
        item['Products'] = response.xpath('').extract()
        yield item

parse_attr在从页面提取company_data并将其传递给@href时调用company_page，但是，此href并不总是存在。我该如何检查href是否存在，如果不存在，则阻止scrapy转移到其他功能？

以上代码不满足此条件，因为company_page始终为true。

如果没有href，我想要的是停止运行，仅用已有的项目完成其工作。如果找到href，那么我想让scrapy转到其他功能并提取其他项目。

Answer 1

response.urljoin()将始终返回某些内容（请求的基本URL），即使该参数为空。因此，您的变量将始终包含一个值，因此其值为True。

您需要在条件内进行URL联接。例如：

company_page = resource.xpath('/div/@href').extract_first()

if company_page:
    company_page = response.urljoin(company_page)
    request = scrapy.Request(company_page, callback = self.company_data)
    request.meta['item'] = item
    yield request
else:
    yield item

如何检查xpath的网址是否存在？

1 个答案: