我有一个网站,其中每个人的详细信息都存储在单独的.HTML文件中。所以总共有100个人的细节存储在100个不同的.html文件中。但都具有相同的HTML结构。
这是网站链接 http://www.coimbatore.com/doctors/home.htm。
因此,如果您看到此网站有许多类别,~all-doctors.html~
文件位于同一目录中。
http://www.coimbatore.com/doctors/cardiology.htm
有5个医生的链接。如果我点击任何医生名称,它将需要
http://www.coimbatore.com/doctors/的 thatdoctorname 的.htm。所以所有的文件都在同一个目录/医生/如果我没有错。那么我如何刮取每位医生的细节呢?
我计划wget
来自http://www.coimbatore.com/doctors/网址的所有文件,在本地保存,并使用LINUX中的whole.html
函数合并为一个join
文件。还有更好的办法吗?
更新
letters = ['doctor1','doctor2'...]
for i in range(30):
try:
page = urllib2.urlopen("http://www.coimbatore.com/doctors/{}.htm".format(letters[i]))
except urllib2.HTTPError:
continue
else:
答案 0 :(得分:3)
此代码可以帮助您入门。
import urllib2
from bs4 import BeautifulSoup
doctors = ['thomas']
for doctor in doctors:
try:
page = urllib2.urlopen("http://www.coimbatore.com/doctors/{}.htm".format(doctor))
soup = BeautifulSoup(page)
except urllib2.HTTPError:
continue
rows = soup.find("table", cellspacing=0).find_all('tr')
for row in rows:
cols = row.find_all('td')
print "%s: %s" % (cols[0].get_text().replace('\n', ' '), cols[1].get_text().replace('\n', ' '))
输出为
Name of Doctor: Dr.Thomas Alexander
Qualification: M.D (Internal Medicine), D.M. (Cardiology)
Fellowship & Membership: Fellow of Indian College of Cardiology Associate Fellow
of American College of Cardiology
Address of Clinic / Visiting Hospitals: Kovai Medical Center and Hospital, P.B.N
o.3209, Avanashi Road, Coimbatore-641 014
Telephone Number: +91-422-827784
Consulting Hours: 8am - 5pm
Specialist in: Senior Consultant and Interventional Cardiologist
您可能希望以不同方式处理的一些注意事项。我用空格替换了所有换行符(\n
),因为代码有如此奇怪的换行符:
<td><b><font face="Arial,Helvetica"><font color="#0000FF"><font size=-1>Name
of Doctor</font></font></font></b></td>
请注意,它强制在Name
和of
之间休息。
如果您尝试从中创建CSV,则可以轻松修改脚本以仅拉取每行上的第二个单元格。
答案 1 :(得分:3)
一种方法是使用scrapy:
创建项目:
scrapy startproject doctors && cd doctors
定义要加载的数据(items.py
):
from scrapy.item import Item, Field
class DoctorsItem(Item):
doctor_name = Field()
qualification = Field()
membership = Field()
visiting_hospitals = Field()
phone = Field()
consulting_hours = Field()
specialist_in = Field()
创建蜘蛛。 basic
似乎应该完成任务:
scrapy genspider -t basic doctors_spider 'coimbatore.com'
将其更改为返回Request
个对象,直到每个页面都包含医生的信息:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from doctors.items import DoctorsItem
from scrapy.http import Request
from urlparse import urljoin
class DoctorsSpiderSpider(BaseSpider):
name = "doctors_spider"
allowed_domains = ["coimbatore.com"]
start_urls = [
'http://www.coimbatore.com/doctors/home.htm'
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
for row in hxs.select('/html/body/center[1]/table[@cellpadding = 0]'):
i = DoctorsItem()
i['doctor_name'] = '|'.join(row.select('./tr[1]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
i['qualification'] ='|'.join( row.select('./tr[2]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
i['membership'] = '|'.join(row.select('./tr[3]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
i['visiting_hospitals'] = '|'.join(row.select('./tr[4]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
i['phone'] = '|'.join(row.select('./tr[5]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
i['consulting_hours'] = '|'.join(row.select('./tr[6]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
i['specialist_in'] = '|'.join(row.select('./tr[7]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
yield i
for url in hxs.select('/html/body/center[3]//a/@href').extract():
yield Request(urljoin(response.url, url), callback=self.parse)
for url in hxs.select('/html/body//a/@href').extract():
yield Request(urljoin(response.url, url), callback=self.parse)
像以下一样运行:
scrapy crawl doctors_spider -o doctors.csv -t csv
这将创建一个csv
文件,如:
phone,membership,visiting_hospitals,qualification,specialist_in,consulting_hours,doctor_name
(H)00966 4 6222245|(R)00966 4 6230143 ,,Domat Al Jandal Hospital|Al Jouf |Kingdom Of Saudi Arabia ,"MBBS, MS, MCh ( Cardio-Thoracic)",Cardio Thoracic Surgery,,Dr. N. Rajaratnam
210075,FRCS(Edinburgh) FIACS,"SRI RAMAKRISHNA HOSPITAL|CHEST CLINIC,COWLEY BROWN ROAD,R.S.PURAM,CBE-2","MD.,DPPR.,FACP",PULMONOLOGY/ RESPIRATORY MEDICINE,"9-1, 5-8",DR.T.MOHAN KUMAR
+91-422-827784-827790,Member -IAPMR,"Kovai Medical Center & Hospital, Avanashi Road,|Coimbatore-641 014","M.B.B.S., Dip.in. Physical Medicine & Rehabilitation","Neck and Back pain, Joint pain, Amputee Rehabilitation,|Spinal cord Injuries & Stroke",9.00am to 5.00pm (Except Sundays),Dr.Edmund M.D'Couto
+91-422-303352,*********,"206, Puliakulam Road, Coimbatore-641 045","M.B.B.S., M.D., D.V.",Sexually Transonitted Diseases.,5.00pm - 7.00pm,Dr.M.Govindaswamy
...