无法将选择性名称连接到网页中的特定标签

时间:2019-07-05 21:11:44

标签: python python-3.x web-scraping

我已使用request模块和BeautifulSoup库在python中编写了一个脚本,以从网站上获得此标题Browse Our Offices下的不同人员的姓名。问题是当我运行脚本时,它将获得随机填充的名称,这些名称会自动填充,这意味着无需选择任何选项卡。

Website Link

访问该页面时,您可以看到这些标签如下图所示:

enter image description here

我想像下面的图片一样进行选择。更清楚地说-我想选择United states标签,然后选择每个states来解析连接到它们的名称。就是这样。

enter image description here

我尝试过:

import requests
from bs4 import BeautifulSoup

link = "https://www.schooleymitchell.com/offices/"

res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("#consultant_info > strong"):
    print(item.text)

上面的脚本会产生随机名称,但我希望将名称连接到United States标签。

如何在不使用硒的情况下选择United States和其他states选项卡后填充所有名称?

2 个答案:

答案 0 :(得分:2)

重要数据位于带有<div>的{​​{1}}标记中。您只对id="office_box"内部以<div>结尾的顾问感兴趣。第一栏包含名称,第二座城市和州:

-usa

打印:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://www.schooleymitchell.com/offices/'

soup = BeautifulSoup(requests.get(url).text, 'lxml')


for div in soup.select('#office_box div[id*="-usa"] div.consultant_info_container'):
    for a in div.select('a'):
        a.extract()
    info = div.get_text(separator=" ").strip()
    info = re.split(r'\s{2}', info)
    for data in info:
        print('{: ^45}'.format(data), end='|')
    print()

答案 1 :(得分:1)

首先抓取所有人,然后使用id格式的{city}-{state}-{country}对其进行过滤。一个问题是多字州/市名称中的空格被破折号-代替。但是我们可以通过使用左侧栏中的状态列表创建查找表来轻松处理它。

方法如下:

import requests
from bs4 import BeautifulSoup


def make_soup(url: str) -> BeautifulSoup:
    res = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'
    })
    res.raise_for_status()
    return BeautifulSoup(res.text, 'html.parser')


def extract_people(soup: BeautifulSoup) -> list:
    people = []
    state_ids = {s['id']: s.text.strip()
                 for s in soup.select('#state-usa .state')}
    for person in soup.select('#office_box .office'):
        person_id = person['id']
        location, country = person['id'].rsplit('-', 1)
        if country != 'usa':
            continue

        state, city = None, None
        for k in state_ids.keys():
            if k in location:
                state = state_ids[k]
                city = location.replace(k, '').replace('-', ' ').strip()
                break

        name = person.select_one('#consultant_info > strong').text.strip()
        contact_url = person.select_one('.contact-button')['href']
        p = {
            'name': name,
            'state': state,
            'city': city,
            'contact_url': contact_url,
        }
        people.append(p)
    return people


if __name__ == "__main__":
    url = 'https://www.schooleymitchell.com/offices/'
    soup = make_soup(url)
    people = extract_people(soup)

    print(people)

输出:

[
    {'name': 'Steven Bremer', 'state': 'Alabama', 'city': 'Gadsden', 'contact_url': 'https://www.schooleymitchell.com/sbremer/contact'}, 
    {'name': 'David George', 'state': 'Alabama', 'city': 'Montgomery', 'contact_url': 'https://www.schooleymitchell.com/dgeorge/contact'}, 
    {'name': 'Zachary G. Madrigal, MBA', 'state': 'Arizona', 'city': 'Phoenix', 'contact_url': 'https://www.schooleymitchell.com/zmadrigal/contact'}, 
    ...
]