我正在编写一个scrapy程序,用于从页面(例如facebook,twitter等)中捕获社交网络配置文件URL。
我搜索的一些页面上没有这些链接,所以程序需要能够处理它。
我有这行代码,当链接在页面上时找到Twitter配置文件链接但在链接不在页面上时失败:
item['twitterprofileurl'] = startupdetails.xpath("//a[contains(@href,'https://twitter.com') and not(contains(@href,'https://twitter.com/500startups'))]/@href").extract()[0]
如何更改它以便在链接不存在时代码不会失败?
完整代码:
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
import datetime
from saas.items import StartupItemTest
class StartupSpider(Spider):
name = "500cotest"
allowed_domains = ["500.co"]
start_urls = [
"http://500.co/startup/chouxbox/"
]
def parse(self, response):
startup = Selector(response).xpath('//div[contains(@id, "startup_detail")]')
for startupdetails in startup:
item = StartupItemTest()
item['logo'] = startupdetails.xpath('//img[@class="logo"]/@src').extract()[0]
item['startupurl'] = startupdetails.xpath('//a[@class="outline"]/@href').extract()[0]
item['source'] = '500.co'
item['datetime'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
item['description'] = startupdetails.xpath("//p[@class='description']/text()").extract()[0]
item['twitterprofileurl'] = startupdetails.xpath("//a[contains(@href,'https://twitter.com') and not(contains(@href,'https://twitter.com/500startups'))]/@href").extract()[0]
yield item
答案 0 :(得分:2)
使用.extract_first()
方法代替.extract()[0]
。当没有什么可以提取时,它会返回None
。
所以,而不是:
item['twitterprofileurl'] = startupdetails.xpath("<your xpath>").extract()[0]
你有:
item['twitterprofileurl'] = startupdetails.xpath("<your xpath>").extract_first()