我有这个代码,它为我刮了几百页。但有时候a
的xpath根本不存在,我怎么能编辑它以便脚本不会停止并继续运行以获取b
而只是给我那个特定的页面?
`a = response.xpath("//div[@class='headerDiv']/a/@title").extract()[0]
b = response.xpath("//div[@class='headerDiv']/text()").extract()[0].strip()
items['title'] = a + " " + b
yield items`
答案 0 :(得分:1)
只需查看extract()
的结果。
nodes = response.xpath("//div[@class='headerDiv']/a/@title").extract()
a = nodes[0] if nodes else ""
nodes = response.xpath("//div[@class='headerDiv']/text()").extract()
b = nodes[0].strip() if nodes else ""
items['title'] = a + " " + b
yield items
有了Padraic Cunningham的好建议:
a = response.xpath("//div[@class='headerDiv']/a/@title").extract_first(default='')
b = response.xpath("//div[@class='headerDiv']/text()").extract_first(default ='').strip()
items['title'] = (a + " " + b).strip()
yield items
答案 1 :(得分:0)
您可以按照以下方式使用:
import lxml.etree as etree
parser = etree.XMLParser(strip_cdata=False, remove_comments=True)
root = etree.fromstring(data, parser)
#Take Hyperlink as per xpath:
#But Xpath returns list of element so we have to take 0 index of it if it has element
a = root.xpath("//div[@class='headerDiv']/a/@title")
b = response.xpath("//div[@class='headerDiv']/text()")
if a:
items['title'] = a[0].strip() + " " + b[0].strip()
else:
items['title'] = b[0].strip()
yield items