我正在使用python和beautifulsoup从BBB网站获得列表。
我的代码对于yelp和黄页工作正常,但是之后,当我开始使用BBB网站链接时,我得到了错误。
from bs4 import BeautifulSoup
import requests
import sys
import csv
## Get the min and max page numbers
pagenum=0
maxpage =0
## loop go thourgh the pages
while pagenum <= maxpage:
page = 'https://www.bbb.org/search?find_country=USA&find_entity=60980-000&find_id=396_60980-000_alias&find_latlng=40.762801%2C-73.977818&find_loc=New%20York%2C%20NY&find_text=web%20development&find_type=Category&page=2'
source= requests.get(page).text
soup= BeautifulSoup(source, 'lxml')
pagenum = pagenum+10
for PParentDiv in soup.find_all('div' , class_="fbHYdT MuiPaper-rounded"):
try:
PName= PParentDiv.find('a' , class_='Name-sc-1srnbh5-0').get_text()
print(PName)
except Exception as e:
g=''
print('notworking')
这是错误的一部分。
Traceback (most recent call last):
File "E:\Python\Python36\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
chunked=chunked)
File "E:\Python\Python36\lib\site-packages\urllib3\connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "E:\Python\Python36\lib\site-packages\urllib3\connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "E:\Python\Python36\lib\http\client.py", line 1331, in getresponse
答案 0 :(得分:0)
您可以轻松地从包含此信息的脚本标签中对json进行正则表达式,然后使用json库进行解析。这里的优势是在data
变量中,您实际上拥有了一切。我显示了从中提取姓名,地址和电话。
import requests, re, json
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://www.bbb.org/search?find_country=USA&find_entity=60980-000&find_id=396_60980-000_alias&find_latlng=40.762801%2C-73.977818&find_loc=New%20York%2C%20NY&find_text=web%20development&find_type=Category&page=2', headers = headers)
p = re.compile(r'PRELOADED_STATE__ = (.*?);')
data = json.loads(p.findall(r.text)[0])
results = [(item['businessName'], ' '.join([item['address'],item['city'], item['state'], item['postalcode']]), item['phone']) for item in data['searchResult']['results']]
print(results)
答案 1 :(得分:0)
尝试此操作,不使用正则表达式:
scr = soup.find_all('script', id="BbbDtmData")
scr2 = soup.find_all('div', class_="Details-sc-1vh1927-0 hHqWfJ")
companies = []
ids = []
for co in range(len(scr2)):
companies.append(scr2[co].find('a').text)
companies.append(scr2[co].find('strong').text)
id_dat = scr[0].text
target = id_dat.split('var bbbDtmData = ')
data = json.loads(target[1])
final = data2['search']['results']
for i in final:
ids.append(i['businessId'])
for co, id in zip(companies, ids):
print(co,id)
链接页面的输出:
Template Studios/Jinx Studios 94645
115 East 57th St, New York, NY 10022 144428
Roark Tech Services 120257
New York, NY 10017-2452 85275
等