Scrapy,如何使用CSS选择器处理标签之间的错误数据?

时间:2018-05-19 06:28:49

标签: python web-scraping scrapy css-selectors

我正在使用scrapy从网站列表中抓取数据,我正在使用 css 选择器。

数据是这样的:

Name : John Doe
Address : Earth
Age : 30

和html结构是:

<li class='title>
   <span class='q'>Name</span>
   <span class='ans>John Doe</span>
   <br>
   <span class='q'>Address</span>
   <span class='ans>Earth</span>
   <br>
   <span class='q'>Age</span>
   <span class='ans>30</span>
   <br>
</li>

问题是一些,一些地址是空的。 <span class='ans'></span>之间没有任何内容。我该如何处理?对于具有适当结构的地址,报废的数据也应为空。

这是我的代码:

'   class NmcSpider(scrapy.Spider):
name = 'nmc'
allowed_domains = ['nmc.org.np']
start_urls = ['http://nmc.org.np/registered-practitioner.html']

def parse(self, response):
    self.log('hello' +response.url)
    for title in response.css('li.title'):
        try:
            item = {
                'name': title.css('span.Ans::text').extract()[0],
                'address': title.css('span.Ans::text').extract()[1],
                'gender': title.css('span.Ans::text').extract()[2],
                'degree': title.css('span.Ans::text').extract()[3],
                'nmc_no':title.css('span.Ans::text').extract()[4]
            }
        except:
            print("No data")
        yield item    '

1 个答案:

答案 0 :(得分:0)

如果您将使用XPath表达式,则无需使用try/except

for title in response.css('li.title'):
    item = {
        'name': title.xpath('.//span[.="Name"]/following-sibling::span[1]/text()').extract_first(),
        'address': title.xpath('.//span[.="Address"]/following-sibling::span[1]/text()').extract_first(),
    }

<强>更新 这是完整的代码(已测试):

def parse(self, response):
    self.log('hello' +response.url)
    for section in response.xpath('//ul[@id="ImgcategoryCardiologist"]/li'):
            item = {
                'name': section.xpath('.//span[.="Name"]/following-sibling::span[1]/text()').extract_first(),
                'address': section.xpath('.//span[.="Address"]/following-sibling::span[1]/text()').extract_first(),
            }
            yield item