Question

我正在编写一个scrapy程序，用于从页面（例如facebook，twitter等）中捕获社交网络配置文件URL。

我搜索的一些页面上没有这些链接，所以程序需要能够处理它。

我有这行代码，当链接在页面上时找到Twitter配置文件链接但在链接不在页面上时失败：

item['twitterprofileurl'] = startupdetails.xpath("//a[contains(@href,'https://twitter.com') and not(contains(@href,'https://twitter.com/500startups'))]/@href").extract()[0]

如何更改它以便在链接不存在时代码不会失败？

完整代码：

import scrapy
from scrapy import Spider
from scrapy.selector import Selector
import datetime
from saas.items import StartupItemTest


class StartupSpider(Spider):
    name = "500cotest"
    allowed_domains = ["500.co"]
    start_urls = [
        "http://500.co/startup/chouxbox/"
    ]

    def parse(self, response):
        startup = Selector(response).xpath('//div[contains(@id, "startup_detail")]')

        for startupdetails in startup:
            item = StartupItemTest()
            item['logo'] = startupdetails.xpath('//img[@class="logo"]/@src').extract()[0]
            item['startupurl'] = startupdetails.xpath('//a[@class="outline"]/@href').extract()[0]
            item['source'] = '500.co'
            item['datetime'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            item['description'] = startupdetails.xpath("//p[@class='description']/text()").extract()[0]

            item['twitterprofileurl'] = startupdetails.xpath("//a[contains(@href,'https://twitter.com') and not(contains(@href,'https://twitter.com/500startups'))]/@href").extract()[0]
            yield item

Answer 1

使用.extract_first()方法代替.extract()[0]。当没有什么可以提取时，它会返回None。

所以，而不是：

item['twitterprofileurl'] = startupdetails.xpath("<your xpath>").extract()[0]

你有：

item['twitterprofileurl'] = startupdetails.xpath("<your xpath>").extract_first()

Scrapy / Python - 如何处理缺失的数据？

1 个答案: