如何使用Beautiful-Soup从网站中提取数据?

时间:2019-07-23 17:05:43

标签: python beautifulsoup

我正在尝试从特定网站抓取数据,但不幸的是失败了。原因是数据包装在复杂的HTML结构中。

这是我的代码:

Glide.with(<context>)
                .load(uri.toString())
                .listener(new RequestListener<Drawable>() {
                    @Override
                    public boolean onLoadFailed(@Nullable GlideException e, Object model, Target<Drawable> target, boolean isFirstResource) {
                        return false;
                    }

                    @Override
                    public boolean onResourceReady(Drawable resource, Object model, Target<Drawable> target, DataSource dataSource, boolean isFirstResource) {
                        if (resource instanceof GifDrawable) {
                            ((GifDrawable) resource).setLoopCount(1);
                        }
                        return false;
                    }
                })
                .into(<your imageview>);

预期结果:

药房名称:Albert County Pharmacy

药房经理:切尔西·史蒂夫斯

操作证书编号:P107

地址:5883 King Street Riverside-Albert NB E4H 4B5

电话:(506)882-2226

传真:(506)882-2101

网站:albertcountypharmacy.ca

结论

我的代码没有给我想要的正确结果。请给我建议最好的解决方案。

2 个答案:

答案 0 :(得分:1)

如果您只是探索层次结构,则应该能够找到答案,尤其是在idsdivstables上。请参阅下面的一个选项。


myUrl = "https://www.nbpharmacists.ca/site/findpharmacy"                                                
data=requests.get(myUrl)
soup=bs4.BeautifulSoup(data.text,'html.parser')

roster = soup.find('div', attrs={'id': 'rosterRecords'})
tables = roster.findAll('table')

result = [] #initialize a list for all results

for table in tables:    
    info = table.find('td').find('p').text.strip()
    certificate = info.split('Certificate of Operation Number:')[-1].strip()
    manager = info.split('Pharmacy Manager:')[1]\
                    .split('Certificate of Operation Number:')[0].strip()
    addr = table.findAll('td')[-1].text.strip()
    phone = addr.split('Phone:')[-1].split('Fax:')[0].strip()
    fax = addr.split('Fax:')[1].strip().split('\n')[0].strip()
    address = addr.split('Phone:')[0].strip()

    res = {
        'Pharmacy Name': table.find('h2').find('span').text.strip(),
        'Certificate of Operation Number': certificate,
        'Pharmacy Manager': manager,
        'Phone Number': phone,
        'Fax Number': fax,
        'Address': address,
    }

    try:
        res['website'] = table.findAll('td')[-1].find('a').get('href')
    except AttributeError:
        res['website'] = None
    result.append(res) #append pharmacy info

print (result[0])

Out[25]: 
{'Pharmacy Name': 'Albert County Pharmacy',
 'Certificate of Operation Number': 'P107',
 'Pharmacy Manager': 'Chelsea Steeves',
 'Phone Number': '(506) 882-2226',
 'Fax Number': '(506) 882-2101',
 'Address': '5883 King Street \nRiverside-Albert NB E4H 4B5',
 'website': 'http://albertcountypharmacy.ca'}

答案 1 :(得分:0)

一种可能的抓取脚本版本:

import bs4
import requests

myUrl = "https://www.nbpharmacists.ca/site/findpharmacy"
data=requests.get(myUrl)
soup=bs4.BeautifulSoup(data.text,'html.parser')

rows = []
for i, tr in enumerate(soup.select('.roster_tbl tr'), 1):
    title = tr.h2.strong.text.strip()
    manager = tr.select_one('strong:contains("Pharmacy Manager:")').find_next_sibling(text=True).strip()
    certificate = tr.select_one('strong:contains("Certificate of Operation Number:")').find_next_sibling(text=True).strip()
    address = ' '.join(div.text.strip() for div in tr.select('td:last-child div'))

    phone = tr.select_one('span:contains("Phone:")')
    if phone:
        phone = phone.find_next_sibling(text=True).strip()
    else:
        phone = '-'

    fax = tr.select_one('span:contains("Fax:")')
    if fax:
        fax = fax.find_next_sibling(text=True).strip()
    else:
        fax = '-'

    website = tr.select_one('strong:contains("Website:") + a[href]')
    if website:
        website = website['href']
    else:
        website = '-'

    print('** Pharmacy no.{} **'.format(i))
    print('Title:', title)
    print('Pharmacy Manager:', manager)
    print('Certificate of Operation Number:', certificate)
    print('Address:', address)
    print('Phone:', phone)
    print('Fax:', fax)
    print('Website:', website)
    print('*' * 80)

打印:

** Pharmacy no.1 **
Title: Albert County Pharmacy
Pharmacy Manager: Chelsea Steeves
Certificate of Operation Number: P107
Address: 5883 King Street Riverside-Albert NB E4H 4B5
Phone: (506) 882-2226
Fax: (506) 882-2101
Website: http://albertcountypharmacy.ca
********************************************************************************
** Pharmacy no.2 **
Title: Bay Pharmacy
Pharmacy Manager: Mark Barry
Certificate of Operation Number: P157
Address: 5447 Route 117 Baie Ste Anne NB E9A 1E5
Phone: (506) 228-3880
Fax: (506) 228-3716
Website: -
********************************************************************************
** Pharmacy no.3 **
Title: Bayshore Pharmacy
Pharmacy Manager: Curtis Saunders
Certificate of Operation Number: P295
Address: 600 Main Street Suite C 150 Saint John NB E2K 1J5
Phone: (506) 799-4920
Fax: (855) 328-4736
Website: http://Bayshore Specialty Pharmacy
********************************************************************************

...and so on.
相关问题