我正在使用Python Web刮擦发现here的数据表。具体来说,我想提取公司名称,URL,所有者名称,街道,城市和电话。在通过Beautiful Soup运行并拆分后,要过滤的代码显示为:
['\\\',\\\'href =“?listingid = 9758&profileid = 217Y3Q544Y&action = uweb&url = http%3a%2f%2f www.jpspa.com ” target =“ _ BLANK ”,“ Johnson Price Sprinkle PA ”,“ / a”,“”,“ / b”,“”,“ / td”,“”,“ / tr”,“”,“ / table“,”','/ td“,”','/ tr“,”,“ tr class =” GeneralBody“”,“,” td bgcolor =“#808080” height =“ 1”“, '','img border =“ 0” height =“ 1” src =“ images / dot_clear.gif” width =“ 1” /',“','/ td”,“','/ tr”,“' ,'/ table“,”','/ td“,”','/ tr“,”,“ tr class =” GeneralBody“”,“,” td align =“ left” valign =“ top”宽度=“ 90%”',' Maria Pilos ',“','',' 79 Woodfin Place,Suite 300 ”,“','',' NC,阿什维尔28801 ”,“”,“”,“”,“ b”,“电话:”,“ / b”,“ ** (828)254-2374 **',“,”,“,”,“ b”,“传真:”,'/ b“,”(828)252-9994“,” \“,\'”,“ \\\”, \\\'href =“ DirectoryEmailForm.aspx?listingid = 9758”',“发送电子邮件”,'/ a“,”','/ td“,'','td align =” right“ rowspan =” 3“ valign =“ top” width =“ 10%”','','span style =“ font-size:8pt”','\\\',\\ \'href =“ ?,'!-.. End Listing--”,'',“ / td'] << / p>
我加粗了要返回的值,并确定了它们在代码中的位置。要过滤它们,代码如下。 Temp_array是上面要过滤的代码,temp_count是数组中的位置,而business_listing是我在找到值时将值附加到的数组。基本上,当temp_count ==值在数组中的位置时,它将把该值附加到数组中。
<
temp_count=0
for i in temp_array:
if temp_count ==0:
business_listings.append(i)
temp_count+=1
elif temp_count ==2:
business_listings.append(i)
temp_count+=1
elif temp_count ==19:
business_listings.append(i)
temp_count+=1
elif temp_count ==19:
business_listings.append(i)
temp_count+=1
elif temp_count ==20:
business_listings.append(i)
temp_count+=1
elif temp_count ==23:
business_listings.append(i)
temp_count+=1
elif temp_count ==27:
business_listings.append(i)
temp_count+=1
elif temp_count ==42:
business_listings.append(i)
temp_count+=1
else:
count+=1
输出如下: ['\\\',\\\'href =“?listingid = 9758&profileid = 2B713K5Z48&action = uweb&url = http%3a%2f%2fwww.jpspa.com” target =“ _ BLANK”']> 并且仅过滤前两个值,或者不过滤任何内容。
答案 0 :(得分:0)
此脚本将打印有关各种业务的信息:
import requests
from bs4 import BeautifulSoup
url = 'https://web.ashevillechamber.org/cwt/external/wcpages/wcdirectory/Directory.aspx?CategoryID=1242&Title=Accounting++and++Bookkeeping&AdKeyword=Accounting++and++Bookkeeping'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for b in soup.select('td[bgcolor="#E6E6E6"] b'):
business_name = b.text
business_url = b.a['href'] if b.a else '-'
owner = b.find_next('td', width="90%").contents[0]
addr, current = [], owner.find_next(text=True)
while not current.find_parent('b'):
addr.append(current.strip())
current = current.find_next(text=True)
addr = '\n'.join(addr)
phone = current.find_next(text=True).strip()
print('Business Name :', business_name)
print('Business URL :', business_url)
print('Owner :', owner)
print('Phone :', phone)
print('Address:')
print(addr)
print('-' * 80)
打印:
Business Name : Johnson Price Sprinkle PA
Business URL : ?listingid=9758&profileid=2D7R3B5E4N&action=uweb&url=http%3a%2f%2fwww.jpspa.com
Owner : Maria Pilos
Phone : (828) 254-2374
Address:
79 Woodfin Place, Suite 300
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Leah B. Noel, CPA, PC
Business URL : ?listingid=9656&profileid=549S620J3J&action=uweb&url=http%3a%2f%2fwww.lbnoelcpa.com%2f
Owner : Ms. Leah Noel
Phone : 828-333-4529
Address:
14 S. Pack Square #503
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Worley, Woodbery, & Associates, PA
Business URL : ?listingid=9661&profileid=3L7R304J8X&action=uweb&url=http%3a%2f%2fwww.worleycpa.com%2f
Owner : Mr. David Worley
Phone : (828) 271-7997
Address:
7 Orchard Street, Ste. 202
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Peridot Consulting, Inc.
Business URL : ?listingid=14005&profileid=7L724E5W7E&action=uweb&url=http%3a%2f%2fwww.PeridotConsultingInc.com
Owner : John Michael Kledis
Phone : (828) 242-6971
Address:
PO Box 8904
Asheville, NC 28804
--------------------------------------------------------------------------------
Business Name : DHG
Business URL : ?listingid=9579&profileid=25711D625I&action=uweb&url=http%3a%2f%2fwww.dhgllp.com%2f
Owner : Adrienne Bernardi
Phone : (828) 254-2254
Address:
PO Box 3049
Asheville, NC 28802
--------------------------------------------------------------------------------
Business Name : Gould Killian CPA Group, P.A.
Business URL : ?listingid=9659&profileid=2P7X216Y66&action=uweb&url=http%3a%2f%2fwww.gk-cpa.com
Owner : Ed Towson
Phone : (828) 258-0363
Address:
100 Coxe Avenue
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Michelle Tracz CPA, CFE, PLLC
Business URL : ?listingid=12921&profileid=610C8H3I7N&action=uweb&url=http%3a%2f%2fwww.michelletraczcpa.com
Owner : Michelle Tracz
Phone : (828) 280-2530
Address:
1238 Hendersonville Rd.
Asheville, NC 28803
--------------------------------------------------------------------------------
Business Name : Burleson & Earley, P.A.
Business URL : ?listingid=10436&profileid=57132N5P9C&action=uweb&url=http%3a%2f%2fwww.burlesonearley.com%2f
Owner : Bronwyn Burleson, CPA
Phone : (828) 251-2846
Address:
902 Sand Hill Road
Asheville, NC 28806
--------------------------------------------------------------------------------
Business Name : Carol L. King & Associates, P.A.
Business URL : ?listingid=10439&profileid=2Z8C7I0B4X&action=uweb&url=http%3a%2f%2fwww.clkcpa.com
Owner : Carol King
Phone : (828) 258-2323
Address:
40 North French Broad Avenue
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Goldsmith Molis & Gray
Business URL : ?listingid=12638&profileid=6C8D2C7F55&action=uweb&url=http%3a%2f%2fwww.gmg-cpa.com
Owner : Allen Gray
Phone : (828) 281-3161
Address:
32 Orange St.
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Corliss & Solomon, PLLC
Business URL : ?listingid=12407&profileid=6T7Y798S1R&action=uweb&url=http%3a%2f%2fwww.candspllc.com
Owner : Slater Solomon
Phone : (828) 236-0206
Address:
242 Charlotte St., Suite 1
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Mountain BizWorks
Business URL : ?listingid=12733&profileid=2L9E9G6A1S&action=uweb&url=http%3a%2f%2fwww.mountainbizworks.org
Owner : Matthew Raker
Phone : (828) 253-2834
Address:
153 South Lexington Ave.
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : LeBlanc CPA Limited
Business URL : -
Owner : Leslie LeBlanc
Phone : (828) 225-4940
Address:
218 Broadway
Asheville, NC 28801-2347
--------------------------------------------------------------------------------
Business Name : Bolick & Associates, PA, CPA's
Business URL : -
Owner : Alan E Bolick, CPA
Phone : (828) 253-4692
Address:
Central Office Park Suite 104
56 Central Avenue
Asheville, NC 28801
--------------------------------------------------------------------------------
编辑:解析URL:
import requests
from bs4 import BeautifulSoup
from urllib.parse import unquote
url = 'https://web.ashevillechamber.org/cwt/external/wcpages/wcdirectory/Directory.aspx?CategoryID=1242&Title=Accounting++and++Bookkeeping&AdKeyword=Accounting++and++Bookkeeping'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for b in soup.select('td[bgcolor="#E6E6E6"] b'):
business_name = b.text
business_url = b.a['href'] if b.a else '-'
owner = b.find_next('td', width="90%").contents[0]
addr, current = [], owner.find_next(text=True)
while not current.find_parent('b'):
addr.append(current.strip())
current = current.find_next(text=True)
addr = '\n'.join(addr)
phone = current.find_next(text=True).strip()
print('Business Name :', business_name)
print('Business URL :', unquote(business_url).rsplit('=', maxsplit=1)[-1])
print('Owner :', owner)
print('Phone :', phone)
print('Address:')
print(addr)
print('-' * 80)
打印:
Business Name : Johnson Price Sprinkle PA
Business URL : http://www.jpspa.com
Owner : Maria Pilos
Phone : (828) 254-2374
Address:
79 Woodfin Place, Suite 300
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Leah B. Noel, CPA, PC
Business URL : http://www.lbnoelcpa.com/
Owner : Ms. Leah Noel
Phone : 828-333-4529
Address:
14 S. Pack Square #503
Asheville, NC 28801
--------------------------------------------------------------------------------
...and so on.