我已使用request模块和BeautifulSoup库在python中编写了一个脚本,以从网站上获得此标题Browse Our Offices
下的不同人员的姓名。问题是当我运行脚本时,它将获得随机填充的名称,这些名称会自动填充,这意味着无需选择任何选项卡。
访问该页面时,您可以看到这些标签如下图所示:
我想像下面的图片一样进行选择。更清楚地说-我想选择United states
标签,然后选择每个states
来解析连接到它们的名称。就是这样。
我尝试过:
import requests
from bs4 import BeautifulSoup
link = "https://www.schooleymitchell.com/offices/"
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("#consultant_info > strong"):
print(item.text)
上面的脚本会产生随机名称,但我希望将名称连接到United States
标签。
如何在不使用硒的情况下选择United States
和其他states
选项卡后填充所有名称?
答案 0 :(得分:2)
重要数据位于带有<div>
的{{1}}标记中。您只对id="office_box"
内部以<div>
结尾的顾问感兴趣。第一栏包含名称,第二座城市和州:
-usa
打印:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://www.schooleymitchell.com/offices/'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for div in soup.select('#office_box div[id*="-usa"] div.consultant_info_container'):
for a in div.select('a'):
a.extract()
info = div.get_text(separator=" ").strip()
info = re.split(r'\s{2}', info)
for data in info:
print('{: ^45}'.format(data), end='|')
print()
答案 1 :(得分:1)
首先抓取所有人,然后使用id
格式的{city}-{state}-{country}
对其进行过滤。一个问题是多字州/市名称中的空格被破折号-
代替。但是我们可以通过使用左侧栏中的状态列表创建查找表来轻松处理它。
方法如下:
import requests
from bs4 import BeautifulSoup
def make_soup(url: str) -> BeautifulSoup:
res = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'
})
res.raise_for_status()
return BeautifulSoup(res.text, 'html.parser')
def extract_people(soup: BeautifulSoup) -> list:
people = []
state_ids = {s['id']: s.text.strip()
for s in soup.select('#state-usa .state')}
for person in soup.select('#office_box .office'):
person_id = person['id']
location, country = person['id'].rsplit('-', 1)
if country != 'usa':
continue
state, city = None, None
for k in state_ids.keys():
if k in location:
state = state_ids[k]
city = location.replace(k, '').replace('-', ' ').strip()
break
name = person.select_one('#consultant_info > strong').text.strip()
contact_url = person.select_one('.contact-button')['href']
p = {
'name': name,
'state': state,
'city': city,
'contact_url': contact_url,
}
people.append(p)
return people
if __name__ == "__main__":
url = 'https://www.schooleymitchell.com/offices/'
soup = make_soup(url)
people = extract_people(soup)
print(people)
输出:
[
{'name': 'Steven Bremer', 'state': 'Alabama', 'city': 'Gadsden', 'contact_url': 'https://www.schooleymitchell.com/sbremer/contact'},
{'name': 'David George', 'state': 'Alabama', 'city': 'Montgomery', 'contact_url': 'https://www.schooleymitchell.com/dgeorge/contact'},
{'name': 'Zachary G. Madrigal, MBA', 'state': 'Arizona', 'city': 'Phoenix', 'contact_url': 'https://www.schooleymitchell.com/zmadrigal/contact'},
...
]