假设我的html页面是这样的:
...
<a class="hehe"><span>joke23</span></a>
<a class="hrtojoke" href="link/to/joke23"></a>
<a class="hehe"><span>joke24</span></a>
<a class="hehe"><span>joke25</span></a>
<a class="hrtojoke" href="link/to/joke25"></a>
...
如您所见,我没有指向joke24
;)的链接
我希望每个笑话分配他的链接。如果链接不存在,我想将其指定为None
。
我的代码:
...
def parse(self, response):
for joke, link in response.css(itertools.zip_longest(response.css('a.hehe'), response.css('a.hrtojoke')):
yield {
'name_joke': joke.xpath('span/text()').extract_first(),
'link_joke': link.css('::attr(href)').extract_first(),
}
...
正如您可以猜到的,此代码有效,但没有正确
当前输出:
...
{'name_joke': 'joke23', 'link_joke': 'link/to/joke23'}
{'name_joke': 'joke25', 'link_joke': 'link/to/joke25'}
error..
...
期望输出:
{'name_joke': 'joke23', 'link_joke': 'link/to/joke23'}
{'name_joke': 'joke24', 'link_joke': None}
{'name_joke': 'joke25', 'link_joke': 'link/to/joke25'}
我如何实现目标?
答案 0 :(得分:3)
试试这个:
def parse(self, response):
for item in response.xpath('//*[@class="hehe"]'):
joke = item.xpath('./span/text()').extract_first()
link = item.xpath('./following-sibling::*[1][@class="hrtojoke"]/@href').extract_first()
yield {'name_joke': joke, 'link_joke': link}
输出:
{'joke_name': 'joke23', 'link_joke': 'link/to/joke23'}
{'joke_name': 'joke24', 'link_joke': None}
{'joke_name': 'joke25', 'link_joke': 'link/to/joke25'}
答案 1 :(得分:0)
只需使用try-except来捕获异常 *记得抓住确切的例外。
def parse(self, response):
for joke, link in response.css(itertools.zip_longest(response.css('a.hehe'), response.css('a.hrtojoke')):
name_joke = joke.xpath('span/text()')extractfirst()
try:
link_joke = link.css('::attr(href)').extract_first()
except: # pls add the exact exception you want to catch.
link_joke = None
yield {
'name_joke': name_joke,
'link_joke': link_joke
}