因此,我尝试抓取网站of its staff roster,并且希望最终产品成为{staff:position}格式的字典。我目前坚持使用它,将每个职员的姓名和职位作为单独的字符串返回。很难清楚地发布输出,但是从本质上讲,它沿着名称列表,然后是位置。因此,例如,列表上的名字将与第一个位置配对,依此类推。我确定每个名称和职位都是class 'bs4.element.Tag
。我相信我需要获取名称和位置,并从中列出一个列表,然后使用zip
将元素放入字典中。我已尝试实现此功能,但到目前为止没有任何效果。使用class_
参数可以得到的文字最低要求是包含div
的单个p
。我对python还是很陌生,对Web抓取还是陌生的,但是我相对而言非常熟悉html和CSS,因此非常感谢您的帮助。
# Simple script attempting to scrape
# the staff roster off of the
# Greenville Drive website
import requests
from bs4 import BeautifulSoup
URL = 'https://www.milb.com/greenville/ballpark/frontoffice'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
staff = soup.find_all('div', class_='l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-3 l-grid__col--lg-3 l-grid__col--xl-3')
for staff in staff:
data = staff.find('p')
if data:
print(data.text.strip())
position = soup.find_all('div', class_='l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-6 l-grid__col--lg-6 l-grid__col--xl-6')
for position in position:
data = position.find('p')
if data:
print(data.text.strip())
# This code so far provides the needed data, but need it in a dict()
答案 0 :(得分:3)
BeautifulSoup具有find_next()
,可用于获取具有指定匹配过滤器的下一个标签。找到“职员” div
,然后使用 find_next()获取相邻的“位置” div
。
import requests
from bs4 import BeautifulSoup
URL = 'https://www.milb.com/greenville/ballpark/frontoffice'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
staff_class = 'l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-3 l-grid__col--lg-3 l-grid__col--xl-3'
position_class = 'l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-6 l-grid__col--lg-6 l-grid__col--xl-6'
result = {}
for staff in soup.find_all('div', class_=staff_class):
data = staff.find('p')
if data:
staff_name = data.text.strip()
postion_div = staff.find_next('div', class_=position_class)
postion_name = postion_div.text.strip()
result[staff_name] = postion_name
print(result)
输出
{'Craig Brown': 'Owner/Team President', 'Eric Jarinko': 'General Manager', 'Nate Lipscomb': 'Special Advisor to the President', 'Phil Bargardi': 'Vice President of Sales', 'Jeff Brown': 'Vice President of Marketing', 'Greg Burgess, CSFM': 'Vice President of Operations/Grounds', 'Jordan Smith': 'Vice President of Finance', 'Ned Kennedy': 'Director of Inside Sales', 'Patrick Innes': 'Director of Ticket Operations', 'Micah Gold': 'Senior Account Executive', 'Molly Mains': 'Senior Account Executive', 'Houghton Flanagan': 'Account Executive', 'Jeb Maloney': 'Account Executive', 'Olivia Adams': 'Inside Sales Representative', 'Tyler Melson': 'Inside Sales Representative', 'Toby Sandblom': 'Inside Sales Representative', 'Katie Batista': 'Director of Sponsorships and Community Engagement', 'Matthew Tezza': 'Sponsor Services and Activations Manager', 'Melissa Welch': 'Sponsorship and Community Events Manager', 'Beth Rusch': 'Director of West End Events', 'Kristin Kipper': 'Events Manager', 'Grant Witham': 'Events Manager', 'Alex Guest': 'Director of Game Entertainment & Production', 'Lance Fowler': 'Director of Video Production', 'Davis Simpson': 'Director of Media and Creative Services', 'Cameron White': 'Media Relations Manager', 'Ed Jenson': 'Broadcaster', 'Adam Baird': 'Accountant', 'Mike Agostino': 'Director of Food and Beverage', 'Roger Campana': 'Assistant Director of Food and Beverage', 'Wilbert Sauceda': 'Executive Chef', 'Elise Parish': 'Premium Services Manager', 'Timmy Hinds': 'Director of Facility Operations', 'Zack Pagans': 'Assistant Groundskeeper', 'Amanda Medlin': 'Business and Team Operations Manager', 'Allison Roedell': 'Office Manager'}
答案 1 :(得分:1)
使用CSS选择器和zip()
的解决方案:
import requests
from bs4 import BeautifulSoup
url = 'https://www.milb.com/greenville/ballpark/frontoffice'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out = {}
for name, position in zip( soup.select('div:has(+ div p) b'),
soup.select('div:has(> div b) + div p')):
out[name.text] = position.text
from pprint import pprint
pprint(out)
打印:
{'Adam Baird': 'Accountant',
'Alex Guest': 'Director of Game Entertainment & Production',
'Allison Roedell': 'Office Manager',
'Amanda Medlin': 'Business and Team Operations Manager',
'Beth Rusch': 'Director of West End Events',
'Brady Andrews': 'Assistant Director of Facility Operations',
'Brooks Henderson': 'Merchandise Manager',
'Bryan Jones': 'Facilities Cleanliness Manager',
'Cameron White': 'Media Relations Manager',
'Craig Brown': 'Owner/Team President',
'Davis Simpson': 'Director of Media and Creative Services',
'Ed Jenson': 'Broadcaster',
'Elise Parish': 'Premium Services Manager',
'Eric Jarinko': 'General Manager',
'Grant Witham': 'Events Manager',
'Greg Burgess, CSFM': 'Vice President of Operations/Grounds',
'Houghton Flanagan': 'Account Executive',
'Jeb Maloney': 'Account Executive',
'Jeff Brown': 'Vice President of Marketing',
'Jenny Burgdorfer': 'Director of Merchandise',
'Jordan Smith ': 'Vice President of Finance',
'Katie Batista': 'Director of Sponsorships and Community Engagement',
'Kristin Kipper': 'Events Manager',
'Lance Fowler': 'Director of Video Production',
'Matthew Tezza': 'Sponsor Services and Activations Manager',
'Melissa Welch': 'Sponsorship and Community Events Manager',
'Micah Gold': 'Senior Account Executive',
'Mike Agostino': 'Director of Food and Beverage',
'Molly Mains': 'Senior Account Executive',
'Nate Lipscomb': 'Special Advisor to the President',
'Ned Kennedy': 'Director of Inside Sales',
'Olivia Adams': 'Inside Sales Representative',
'Patrick Innes': 'Director of Ticket Operations',
'Phil Bargardi': 'Vice President of Sales',
'Roger Campana': 'Assistant Director of Food and Beverage',
'Steve Seman': 'Merchandise / Ticketing Advisor',
'Timmy Hinds': 'Director of Facility Operations',
'Toby Sandblom': 'Inside Sales Representative',
'Tyler Melson': 'Inside Sales Representative',
'Wilbert Sauceda': 'Executive Chef',
'Zack Pagans': 'Assistant Groundskeeper'}