网络爬虫运动 - 使用 Python Beautiful Soup 参考

时间:2021-04-14 01:29:51

标签: python pandas beautifulsoup python-requests

我正在尝试从 Nick Saban 的体育参考页面中抓取数据,以便我可以提取他执教的全美球员名单,然后是他的碗赢损失百分​​比。

我是 Python 新手,所以这是一场巨大的斗争。当我检查页面时,我看到 div id = #leaderboard_all-americans class = "data_grid_box"

当我运行下面的代码时,我得到了指导记录表,这是网站上的第一个表。我尝试使用不同的索引,认为它可能会给我带来不同的结果,但这也不起作用。

最终,我想获取全美数据并将其转换为数据框。

import requests
import bs4
import pandas as pd

saban2 = requests.get("https://www.sports-reference.com/cfb/coaches/nick-saban-1.html")
saban_soup2 = bs4.BeautifulSoup(saban2.text,"lxml")
saban_select = saban_soup2.select('div',{"id":"leaderboard_all-americans"})
saban_df2 = pd.read_html(str(saban_select))

All Americans

1 个答案:

答案 0 :(得分:1)

sports-reference.com 将 HTML 表存储为基本请求响应中的注释。您必须首先获取包含 All-Americans 和 Bowl 结果的注释块,然后解析该结果:

import bs4
from bs4 import BeautifulSoup as soup
import requests, pandas as pd
d = soup(requests.get('https://www.sports-reference.com/cfb/coaches/nick-saban-1.html').text, 'html.parser')
block = [i for i in d.find_all(string=lambda text: isinstance(text, bs4.Comment)) if 'id="leaderboard_all-americans"' in i][0]
b = soup(str(block), 'html.parser')
players = [i for i in b.select('#leaderboard_all-americans table.no_columns tr')]
p_results = [{'name':i.td.a.text, 'year':i.td.contents[-1][2:-1]} for i in players]
all_americans = pd.DataFrame(p_results)
bowl_win_loss = b.select_one('#leaderboard_win_loss_pct_post td.single').contents[-2]
print(all_americans)
print(bowl_win_loss)

输出:

all_americans

                  name       year
0       Jonathan Allen       2016
1        Javier Arenas       2009
2          Mark Barron       2011
3     Antoine Caldwell       2008
4    Ha Ha Clinton-Dix       2013
5        Terrence Cody  2008-2009
6       Landon Collins       2014
7         Amari Cooper       2014
8     Landon Dickerson       2020
9   Minkah Fitzpatrick  2016-2017
10       Reuben Foster       2016
11        Najee Harris       2020
12       Derrick Henry       2015
13    Dont'a Hightower       2011
14         Mark Ingram       2009
15         Jerry Jeudy       2018
16        Mike Johnson       2009
17       Barrett Jones  2011-2012
18           Mac Jones       2020
19          Ryan Kelly       2015
20     Cyrus Kouandjio       2013
21       Chad Lavalais       2003
22    Alex Leatherwood       2020
23     Rolando McClain       2009
24   Demarcus Milliner       2012
25         C.J. Mosley  2012-2013
26      Reggie Ragland       2015
27           Josh Reed       2001
28    Trent Richardson       2011
29    A'Shawn Robinson       2015
30        Cam Robinson       2016
31         Andre Smith       2008
32       DeVonta Smith       2020
33       Marcus Spears       2004
34  Patrick Surtain II       2020
35      Tua Tagovailoa       2018
36    Deionte Thompson       2018
37      Chance Warmack       2012
38       Ben Wilkerson       2004
39      Jonah Williams       2018
40    Quinnen Williams       2018

bowl_win_loss

' .63 (#23)'
相关问题