我在python中不是很好,我想从网站获取数据,表中的数据,我希望这些数据在txt / xls中,
我制作了一个脚本但是当我的脚本进入网站时,它会很好地工作,直到一个条目没有数据。
Webiste:bizearch.com
在此条目中我的python脚本停止:www.bizearch.com/company/Russell_Metal_Products_Inc_125558.htm
我正在使用CentOS,Python,BeautifulSoup。
我的剧本:
#/usr/bin/env python
#
from bs4 import BeautifulSoup
import urllib
getInfo = ['Company Name', 'Contact Person', 'Company Address', 'Postal Code', 'Telephone Number', 'Mobile Number', 'Fax Number', 'Website', 'Business Type', 'Business Role']
flushData = {}
print "Company Name|Contact Person|Company Address|Postal Code|Telephone Number|Mobile Number|Fax Number|Website|Business Type|Business Role"
for Page in range(1,900):
pageData = urllib.urlopen("http://www.bizearch.com/company/Electrical_Equipment~Supplies.8-%d.htm" % (Page))
html = pageData.read()
parsed_html = BeautifulSoup(html)
for Row in parsed_html.body.findAll('div', attrs={'class':'ls'}):
profileURL = Row.find('a').get('href')
profileURLHTML = urllib.urlopen(profileURL)
profileURLHTML = BeautifulSoup(profileURLHTML)
finalData = []
for Details in profileURLHTML.body.find('div', attrs={'id':'yellowpage'}).findAll('tr') :
if Details.find('th').text in getInfo:
flushData[Details.find('th').text] = Details.find('td').text
flushDataPrint = "%s|%s|%s|%s|%s|%s|%s|%s|%s|%s" % (flushData['Company Name'], flushData['Contact Person'], flushData['Company Address'], flushData['Postal Code'], flushData['Telephone Number'], flushData['Mobile Number'], flushData['Fax Number'], flushData['Website'], flushData['Business Type'], flushData['Business Role'])
print flushDataPrint
我是这个网站的新手,如果我错过了什么就道歉。