我正在使用scrapy从网站列表中抓取数据,我正在使用 css 选择器。
数据是这样的:
Name : John Doe
Address : Earth
Age : 30
和html结构是:
<li class='title>
<span class='q'>Name</span>
<span class='ans>John Doe</span>
<br>
<span class='q'>Address</span>
<span class='ans>Earth</span>
<br>
<span class='q'>Age</span>
<span class='ans>30</span>
<br>
</li>
问题是一些,一些地址是空的。 <span class='ans'></span>
之间没有任何内容。我该如何处理?对于具有适当结构的地址,报废的数据也应为空。
这是我的代码:
' class NmcSpider(scrapy.Spider):
name = 'nmc'
allowed_domains = ['nmc.org.np']
start_urls = ['http://nmc.org.np/registered-practitioner.html']
def parse(self, response):
self.log('hello' +response.url)
for title in response.css('li.title'):
try:
item = {
'name': title.css('span.Ans::text').extract()[0],
'address': title.css('span.Ans::text').extract()[1],
'gender': title.css('span.Ans::text').extract()[2],
'degree': title.css('span.Ans::text').extract()[3],
'nmc_no':title.css('span.Ans::text').extract()[4]
}
except:
print("No data")
yield item '
答案 0 :(得分:0)
如果您将使用XPath表达式,则无需使用try/except
:
for title in response.css('li.title'):
item = {
'name': title.xpath('.//span[.="Name"]/following-sibling::span[1]/text()').extract_first(),
'address': title.xpath('.//span[.="Address"]/following-sibling::span[1]/text()').extract_first(),
}
<强>更新强> 这是完整的代码(已测试):
def parse(self, response):
self.log('hello' +response.url)
for section in response.xpath('//ul[@id="ImgcategoryCardiologist"]/li'):
item = {
'name': section.xpath('.//span[.="Name"]/following-sibling::span[1]/text()').extract_first(),
'address': section.xpath('.//span[.="Address"]/following-sibling::span[1]/text()').extract_first(),
}
yield item