Scrapy / Python - 如何处理缺失的数据?

时间:2016-06-24 19:43:09

标签: python scrapy

我正在编写一个scrapy程序,用于从页面(例如facebook,twitter等)中捕获社交网络配置文件URL。

我搜索的一些页面上没有这些链接,所以程序需要能够处理它。

我有这行代码,当链接在页面上时找到Twitter配置文件链接但在链接不在页面上时失败:

item['twitterprofileurl'] = startupdetails.xpath("//a[contains(@href,'https://twitter.com') and not(contains(@href,'https://twitter.com/500startups'))]/@href").extract()[0]

如何更改它以便在链接不存在时代码不会失败?

完整代码:

import scrapy
from scrapy import Spider
from scrapy.selector import Selector
import datetime
from saas.items import StartupItemTest


class StartupSpider(Spider):
    name = "500cotest"
    allowed_domains = ["500.co"]
    start_urls = [
        "http://500.co/startup/chouxbox/"
    ]

    def parse(self, response):
        startup = Selector(response).xpath('//div[contains(@id, "startup_detail")]')

        for startupdetails in startup:
            item = StartupItemTest()
            item['logo'] = startupdetails.xpath('//img[@class="logo"]/@src').extract()[0]
            item['startupurl'] = startupdetails.xpath('//a[@class="outline"]/@href').extract()[0]
            item['source'] = '500.co'
            item['datetime'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            item['description'] = startupdetails.xpath("//p[@class='description']/text()").extract()[0]

            item['twitterprofileurl'] = startupdetails.xpath("//a[contains(@href,'https://twitter.com') and not(contains(@href,'https://twitter.com/500startups'))]/@href").extract()[0]
            yield item

1 个答案:

答案 0 :(得分:2)

使用.extract_first()方法代替.extract()[0]。当没有什么可以提取时,它会返回None

所以,而不是:

item['twitterprofileurl'] = startupdetails.xpath("<your xpath>").extract()[0]

你有:

item['twitterprofileurl'] = startupdetails.xpath("<your xpath>").extract_first()