Question

我正在使用scrapy从网站列表中抓取数据，我正在使用 css 选择器。

数据是这样的：

Name : John Doe
Address : Earth
Age : 30

和html结构是：

<li class='title>
   <span class='q'>Name</span>
   <span class='ans>John Doe</span>
   <br>
   <span class='q'>Address</span>
   <span class='ans>Earth</span>
   <br>
   <span class='q'>Age</span>
   <span class='ans>30</span>
   <br>
</li>

问题是一些，一些地址是空的。 <span class='ans'></span>之间没有任何内容。我该如何处理？对于具有适当结构的地址，报废的数据也应为空。

这是我的代码：

'   class NmcSpider(scrapy.Spider):
name = 'nmc'
allowed_domains = ['nmc.org.np']
start_urls = ['http://nmc.org.np/registered-practitioner.html']

def parse(self, response):
    self.log('hello' +response.url)
    for title in response.css('li.title'):
        try:
            item = {
                'name': title.css('span.Ans::text').extract()[0],
                'address': title.css('span.Ans::text').extract()[1],
                'gender': title.css('span.Ans::text').extract()[2],
                'degree': title.css('span.Ans::text').extract()[3],
                'nmc_no':title.css('span.Ans::text').extract()[4]
            }
        except:
            print("No data")
        yield item    '

Answer 1

如果您将使用XPath表达式，则无需使用try/except：

for title in response.css('li.title'):
    item = {
        'name': title.xpath('.//span[.="Name"]/following-sibling::span[1]/text()').extract_first(),
        'address': title.xpath('.//span[.="Address"]/following-sibling::span[1]/text()').extract_first(),
    }

<强>更新这是完整的代码（已测试）：

def parse(self, response):
    self.log('hello' +response.url)
    for section in response.xpath('//ul[@id="ImgcategoryCardiologist"]/li'):
            item = {
                'name': section.xpath('.//span[.="Name"]/following-sibling::span[1]/text()').extract_first(),
                'address': section.xpath('.//span[.="Address"]/following-sibling::span[1]/text()').extract_first(),
            }
            yield item

Scrapy，如何使用CSS选择器处理标签之间的错误数据？

1 个答案: