网络抓取表过滤结果

时间:2020-08-11 22:01:42

标签: python web-scraping

我正在使用Python Web刮擦发现here的数据表。具体来说,我想提取公司名称,URL,所有者名称,街道,城市和电话。在通过Beautiful Soup运行并拆分后,要过滤的代码显示为:

['\\\',\\\'href =“?listingid = 9758&profileid = 217Y3Q544Y&action = uweb&url = http%3a%2f%2f www.jpspa.com ” target =“ _ BLANK ”,“ Johnson Price Sprinkle PA ”,“ / a”,“”,“ / b”,“”,“ / td”,“”,“ / tr”,“”,“ / table“,”','/ td“,”','/ tr“,”,“ tr class =” GeneralBody“”,“,” td bgcolor =“#808080” height =“ 1”“, '','img border =“ 0” height =“ 1” src =“ images / dot_clear.gif” width =“ 1” /',“','/ td”,“','/ tr”,“' ,'/ table“,”','/ td“,”','/ tr“,”,“ tr class =” GeneralBody“”,“,” td align =“ left” valign =“ top”宽度=“ 90%”',' Maria Pilos ',“','',' 79 Woodfin Place,Suite 300 ”,“','',' NC,阿什维尔28801 ”,“”,“”,“”,“ b”,“电话:”,“ / b”,“ ** (828)254-2374 **',“,”,“,”,“ b”,“传真:”,'/ b“,”(828)252-9994“,” \“,\'”,“ \\\”, \\\'href =“ DirectoryEmailForm.aspx?listingid = 9758”',“发送电子邮件”,'/ a“,”','/ td“,'','td align =” right“ rowspan =” 3“ valign =“ top” width =“ 10%”','','span style =“ font-size:8pt”','\\\',\\ \'href =“ ?,'!-.. End Listing--”,'',“ / td'] << / p>

我加粗了要返回的值,并确定了它们在代码中的位置。要过滤它们,代码如下。 Temp_array是上面要过滤的代码,temp_count是数组中的位置,而business_listing是我在找到值时将值附加到的数组。基本上,当temp_count ==值在数组中的位置时,它将把该值附加到数组中。

        <
        temp_count=0
            for i in temp_array:
                if temp_count ==0:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==2:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==19:
                    business_listings.append(i)
                    temp_count+=1    
                elif temp_count ==19:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==20:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==23:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==27:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==42:
                    business_listings.append(i)
                    temp_count+=1
                    
        else:
            count+=1 

输出如下: ['\\\',\\\'href =“?listingid = 9758&profileid = 2B713K5Z48&action = uweb&url = http%3a%2f%2fwww.jpspa.com” target =“ _ BLANK”']> 并且仅过滤前两个值,或者不过滤任何内容。

1 个答案:

答案 0 :(得分:0)

此脚本将打印有关各种业务的信息:

import requests
from bs4 import BeautifulSoup


url = 'https://web.ashevillechamber.org/cwt/external/wcpages/wcdirectory/Directory.aspx?CategoryID=1242&Title=Accounting++and++Bookkeeping&AdKeyword=Accounting++and++Bookkeeping'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')


for b in soup.select('td[bgcolor="#E6E6E6"] b'):
    business_name = b.text
    business_url = b.a['href'] if b.a else '-'
    owner = b.find_next('td', width="90%").contents[0]

    addr, current = [], owner.find_next(text=True)
    while not current.find_parent('b'):
        addr.append(current.strip())
        current = current.find_next(text=True)

    addr = '\n'.join(addr)
    phone = current.find_next(text=True).strip()

    print('Business Name :', business_name)
    print('Business URL  :', business_url)
    print('Owner         :', owner)
    print('Phone         :', phone)
    print('Address:')
    print(addr)
    print('-' * 80)

打印:

Business Name : Johnson Price Sprinkle PA
Business URL  : ?listingid=9758&profileid=2D7R3B5E4N&action=uweb&url=http%3a%2f%2fwww.jpspa.com
Owner         : Maria Pilos
Phone         : (828) 254-2374
Address:
79 Woodfin Place, Suite 300
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Leah B. Noel, CPA, PC
Business URL  : ?listingid=9656&profileid=549S620J3J&action=uweb&url=http%3a%2f%2fwww.lbnoelcpa.com%2f
Owner         : Ms. Leah Noel
Phone         : 828-333-4529
Address:
14 S. Pack Square #503
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Worley, Woodbery, & Associates, PA
Business URL  : ?listingid=9661&profileid=3L7R304J8X&action=uweb&url=http%3a%2f%2fwww.worleycpa.com%2f
Owner         : Mr. David Worley
Phone         : (828) 271-7997
Address:
7 Orchard Street, Ste. 202
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Peridot Consulting, Inc.
Business URL  : ?listingid=14005&profileid=7L724E5W7E&action=uweb&url=http%3a%2f%2fwww.PeridotConsultingInc.com
Owner         : John Michael  Kledis
Phone         : (828) 242-6971
Address:
PO Box 8904
Asheville, NC  28804
--------------------------------------------------------------------------------
Business Name : DHG
Business URL  : ?listingid=9579&profileid=25711D625I&action=uweb&url=http%3a%2f%2fwww.dhgllp.com%2f
Owner         : Adrienne Bernardi
Phone         : (828) 254-2254
Address:
PO Box 3049
Asheville, NC  28802
--------------------------------------------------------------------------------
Business Name : Gould Killian CPA Group, P.A.
Business URL  : ?listingid=9659&profileid=2P7X216Y66&action=uweb&url=http%3a%2f%2fwww.gk-cpa.com
Owner         : Ed Towson
Phone         : (828) 258-0363
Address:
100 Coxe Avenue
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Michelle Tracz CPA, CFE, PLLC
Business URL  : ?listingid=12921&profileid=610C8H3I7N&action=uweb&url=http%3a%2f%2fwww.michelletraczcpa.com
Owner         : Michelle Tracz
Phone         : (828) 280-2530
Address:
1238 Hendersonville Rd.
Asheville, NC  28803
--------------------------------------------------------------------------------
Business Name : Burleson & Earley, P.A.
Business URL  : ?listingid=10436&profileid=57132N5P9C&action=uweb&url=http%3a%2f%2fwww.burlesonearley.com%2f
Owner         : Bronwyn Burleson, CPA
Phone         : (828) 251-2846
Address:
902 Sand Hill Road
Asheville, NC  28806
--------------------------------------------------------------------------------
Business Name : Carol L. King & Associates, P.A.
Business URL  : ?listingid=10439&profileid=2Z8C7I0B4X&action=uweb&url=http%3a%2f%2fwww.clkcpa.com
Owner         : Carol King
Phone         : (828) 258-2323
Address:
40 North French Broad Avenue
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Goldsmith Molis & Gray
Business URL  : ?listingid=12638&profileid=6C8D2C7F55&action=uweb&url=http%3a%2f%2fwww.gmg-cpa.com
Owner         : Allen Gray
Phone         : (828) 281-3161
Address:
32 Orange St.
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Corliss & Solomon, PLLC
Business URL  : ?listingid=12407&profileid=6T7Y798S1R&action=uweb&url=http%3a%2f%2fwww.candspllc.com
Owner         : Slater Solomon
Phone         : (828) 236-0206
Address:
242 Charlotte St., Suite 1
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Mountain BizWorks
Business URL  : ?listingid=12733&profileid=2L9E9G6A1S&action=uweb&url=http%3a%2f%2fwww.mountainbizworks.org
Owner         : Matthew Raker
Phone         : (828) 253-2834
Address:
153 South Lexington Ave.
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : LeBlanc CPA Limited
Business URL  : -
Owner         : Leslie LeBlanc
Phone         : (828) 225-4940
Address:
218 Broadway
Asheville, NC  28801-2347
--------------------------------------------------------------------------------
Business Name : Bolick & Associates, PA, CPA's
Business URL  : -
Owner         : Alan E Bolick, CPA
Phone         : (828) 253-4692
Address:
Central Office Park   Suite 104
56 Central Avenue
Asheville, NC  28801
--------------------------------------------------------------------------------

编辑:解析URL:

import requests
from bs4 import BeautifulSoup
from urllib.parse import unquote


url = 'https://web.ashevillechamber.org/cwt/external/wcpages/wcdirectory/Directory.aspx?CategoryID=1242&Title=Accounting++and++Bookkeeping&AdKeyword=Accounting++and++Bookkeeping'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')


for b in soup.select('td[bgcolor="#E6E6E6"] b'):
    business_name = b.text
    business_url = b.a['href'] if b.a else '-'
    owner = b.find_next('td', width="90%").contents[0]

    addr, current = [], owner.find_next(text=True)
    while not current.find_parent('b'):
        addr.append(current.strip())
        current = current.find_next(text=True)

    addr = '\n'.join(addr)
    phone = current.find_next(text=True).strip()

    print('Business Name :', business_name)
    print('Business URL  :', unquote(business_url).rsplit('=', maxsplit=1)[-1])
    print('Owner         :', owner)
    print('Phone         :', phone)
    print('Address:')
    print(addr)
    print('-' * 80)

打印:

Business Name : Johnson Price Sprinkle PA
Business URL  : http://www.jpspa.com
Owner         : Maria Pilos
Phone         : (828) 254-2374
Address:
79 Woodfin Place, Suite 300
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Leah B. Noel, CPA, PC
Business URL  : http://www.lbnoelcpa.com/
Owner         : Ms. Leah Noel
Phone         : 828-333-4529
Address:
14 S. Pack Square #503
Asheville, NC  28801
--------------------------------------------------------------------------------

...and so on.