我正在尝试从特定网站抓取数据,但不幸的是失败了。原因是数据包装在复杂的HTML结构中。
Glide.with(<context>)
.load(uri.toString())
.listener(new RequestListener<Drawable>() {
@Override
public boolean onLoadFailed(@Nullable GlideException e, Object model, Target<Drawable> target, boolean isFirstResource) {
return false;
}
@Override
public boolean onResourceReady(Drawable resource, Object model, Target<Drawable> target, DataSource dataSource, boolean isFirstResource) {
if (resource instanceof GifDrawable) {
((GifDrawable) resource).setLoopCount(1);
}
return false;
}
})
.into(<your imageview>);
药房名称:Albert County Pharmacy
药房经理:切尔西·史蒂夫斯
操作证书编号:P107
地址:5883 King Street Riverside-Albert NB E4H 4B5
电话:(506)882-2226
传真:(506)882-2101
网站:albertcountypharmacy.ca
我的代码没有给我想要的正确结果。请给我建议最好的解决方案。
答案 0 :(得分:1)
如果您只是探索层次结构,则应该能够找到答案,尤其是在ids
,divs
和tables
上。请参阅下面的一个选项。
myUrl = "https://www.nbpharmacists.ca/site/findpharmacy"
data=requests.get(myUrl)
soup=bs4.BeautifulSoup(data.text,'html.parser')
roster = soup.find('div', attrs={'id': 'rosterRecords'})
tables = roster.findAll('table')
result = [] #initialize a list for all results
for table in tables:
info = table.find('td').find('p').text.strip()
certificate = info.split('Certificate of Operation Number:')[-1].strip()
manager = info.split('Pharmacy Manager:')[1]\
.split('Certificate of Operation Number:')[0].strip()
addr = table.findAll('td')[-1].text.strip()
phone = addr.split('Phone:')[-1].split('Fax:')[0].strip()
fax = addr.split('Fax:')[1].strip().split('\n')[0].strip()
address = addr.split('Phone:')[0].strip()
res = {
'Pharmacy Name': table.find('h2').find('span').text.strip(),
'Certificate of Operation Number': certificate,
'Pharmacy Manager': manager,
'Phone Number': phone,
'Fax Number': fax,
'Address': address,
}
try:
res['website'] = table.findAll('td')[-1].find('a').get('href')
except AttributeError:
res['website'] = None
result.append(res) #append pharmacy info
print (result[0])
Out[25]:
{'Pharmacy Name': 'Albert County Pharmacy',
'Certificate of Operation Number': 'P107',
'Pharmacy Manager': 'Chelsea Steeves',
'Phone Number': '(506) 882-2226',
'Fax Number': '(506) 882-2101',
'Address': '5883 King Street \nRiverside-Albert NB E4H 4B5',
'website': 'http://albertcountypharmacy.ca'}
答案 1 :(得分:0)
一种可能的抓取脚本版本:
import bs4
import requests
myUrl = "https://www.nbpharmacists.ca/site/findpharmacy"
data=requests.get(myUrl)
soup=bs4.BeautifulSoup(data.text,'html.parser')
rows = []
for i, tr in enumerate(soup.select('.roster_tbl tr'), 1):
title = tr.h2.strong.text.strip()
manager = tr.select_one('strong:contains("Pharmacy Manager:")').find_next_sibling(text=True).strip()
certificate = tr.select_one('strong:contains("Certificate of Operation Number:")').find_next_sibling(text=True).strip()
address = ' '.join(div.text.strip() for div in tr.select('td:last-child div'))
phone = tr.select_one('span:contains("Phone:")')
if phone:
phone = phone.find_next_sibling(text=True).strip()
else:
phone = '-'
fax = tr.select_one('span:contains("Fax:")')
if fax:
fax = fax.find_next_sibling(text=True).strip()
else:
fax = '-'
website = tr.select_one('strong:contains("Website:") + a[href]')
if website:
website = website['href']
else:
website = '-'
print('** Pharmacy no.{} **'.format(i))
print('Title:', title)
print('Pharmacy Manager:', manager)
print('Certificate of Operation Number:', certificate)
print('Address:', address)
print('Phone:', phone)
print('Fax:', fax)
print('Website:', website)
print('*' * 80)
打印:
** Pharmacy no.1 **
Title: Albert County Pharmacy
Pharmacy Manager: Chelsea Steeves
Certificate of Operation Number: P107
Address: 5883 King Street Riverside-Albert NB E4H 4B5
Phone: (506) 882-2226
Fax: (506) 882-2101
Website: http://albertcountypharmacy.ca
********************************************************************************
** Pharmacy no.2 **
Title: Bay Pharmacy
Pharmacy Manager: Mark Barry
Certificate of Operation Number: P157
Address: 5447 Route 117 Baie Ste Anne NB E9A 1E5
Phone: (506) 228-3880
Fax: (506) 228-3716
Website: -
********************************************************************************
** Pharmacy no.3 **
Title: Bayshore Pharmacy
Pharmacy Manager: Curtis Saunders
Certificate of Operation Number: P295
Address: 600 Main Street Suite C 150 Saint John NB E2K 1J5
Phone: (506) 799-4920
Fax: (855) 328-4736
Website: http://Bayshore Specialty Pharmacy
********************************************************************************
...and so on.