我要抓取此网站:
https://doctor.webmd.com/find-a-doctor/specialty/psychiatry/arizona/phoenix
从该列表中,我正在加载并抓取每位医生的数据。 结果中的值看起来不错,但格式却不好,因为每个记录的行都为空。这是我正在使用的代码:
import scrapy
class MainSpiderSpider(scrapy.Spider):
name = 'main_spider'
#allowed_domains = [''link'']
start_urls = ['https://doctor.webmd.com/find-a-doctor/specialty/psychiatry/arizona/phoenix?pagenumber=1']
def parse(self, response):
doctors_urls = (response.xpath('//*[@class="doctorName"]//@href').extract())
for doctor in doctors_urls:
doctor = response.urljoin(doctor)
print (doctor)
yield scrapy.Request(url=doctor,callback=self.parse_doctor)
next_page = response.xpath('//*[@id="next-onRight"]//@href').extract_first()
if next_page:
next_page = response.urljoin(next_page)
yield scrapy.Request(url=next_page,callback=self.parse)
def parse_doctor(self,response):
yield {"Name": response.xpath('//*[@class="header"]//*[@itemprop="name"]//text()').extract_first(),
"Speciality":response.xpath('//*[@itemprop="medicalSpecialty"]//*[@itemprop="name"]//text()').extract_first(),
"Years of experience":response.xpath('//*[@class="profile-content"]//*[@class="subheader content-body years"]//text()').extract_first(),
"Employer": response.xpath('//*[@class="address"]//*[@class="practice"]//text()').extract_first(),
"Address": response.xpath('//*[@itemprop="address"]//*[@itemprop="streetaddress"]//text()').extract(),
"City": response.xpath('//*[@itemprop="address"]//*[@itemprop="addressLocality"]//text()').extract(),
"Url": response.url}
这是我得到的输出:
这是文本编辑器的选项:
除此之外,标题在第一行中出现两次: