如何从具有多个表的网页中抓取特定表?

时间:2020-05-26 00:44:40

标签: python pandas web-scraping beautifulsoup

我正在尝试从以下网站抓取一些NFL数据:

url = https://www.pro-football-reference.com/years/2019/opp.htm.

我首先尝试使用大熊猫从表格中抓取数据。我以前做过,而且一直很简单。我希望熊猫能返回该页面上所有表的列表。可是我跑的时候 dfs = pd.read_html(url) 我只从网页上收到了前两个表,Team Defense和Team Advanced Defense。

然后我去尝试用bs4和请求刮擦其他表。为了测试,我首先只尝试刮擦第一个表:

page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')

table = soup.find('table', id = 'advanced_defense')

rows = table.find_all('tr')

for tr in rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)

然后,我能够简单地更改id,以便我同时返回了Team Defence和Team Advanced Defense-熊猫返回的两个表。

但是,当我尝试使用相同的方法来刮取页面上的其他表时,出现错误。我以与前两个表相同的方式检查网页获得了id,但无法获得结果。

page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')

table = soup.find('table', id = 'passing')

rows = table.find_all('tr')

for tr in rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)

由于我收到以下错误,试图刮除页面上的任何其他表时,找不到table的任何内容

AttributeError: 'NoneType' object has no attribute 'find_all'

我发现pandas和bs4都只能返回Team Defence和Team Advanced Defense表很奇怪。

我只打算刮擦“团队防守”,“通过防守”和“冲刺防守”表。

我该如何成功地清除“通过防御”和“冲抵防御”表?

1 个答案:

答案 0 :(得分:1)

因此体育参考网站的技巧很棘手,因为第一个表(或几个表)确实显示在html源代码中。其他表是动态呈现的。但是,这些其他表在html的注释中。因此,要获取其他表,必须先删除注释,然后可以使用pandas或beautifulsoup来获取这些表标签。

因此,您可以像往常一样获取团队统计信息。然后提取注释并解析其他表。

import pandas as pd
import requests
from bs4 import BeautifulSoup, Comment

url =  'https://www.pro-football-reference.com/years/2019/opp.htm'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))

dfs = [pd.read_html(url, header=0, attrs={'id':'team_stats'})[0]]
dfs[0].columns = dfs[0].iloc[0,:]
dfs[0] = dfs[0].iloc[1:,:].reset_index(drop=True)

for each in comments:
    if 'table' in each and ('id="passing"' in each or 'id="rushing"' in each):
        dfs.append(pd.read_html(each)[0])

输出:

for df in dfs:
    print (df)


0    Rk                    Tm    G     PF  ... 1stPy   Sc%   TO%      EXP
0     1  New England Patriots   16    225  ...    39  19.4  17.3   165.75
1     2         Buffalo Bills   16    259  ...    33  23.6  12.4    39.85
2     3      Baltimore Ravens   16    282  ...    39  32.9  14.6    16.61
3     4         Chicago Bears   16    298  ...    30  31.5  10.7    -4.15
4     5     Minnesota Vikings   16    303  ...    31  34.5  17.0    -7.88
5     6   Pittsburgh Steelers   16    303  ...    30  29.9  19.0    85.78
6     7    Kansas City Chiefs   16    308  ...    39  34.6  13.6   -65.69
7     8   San Francisco 49ers   16    310  ...    30  29.0  14.2    77.41
8     9     Green Bay Packers   16    313  ...    20  34.5  14.1   -63.65
9    10        Denver Broncos   16    316  ...    34  37.3   8.4   -35.98
10   11        Dallas Cowboys   16    321  ...    38  35.5   9.9   -36.81
11   12      Tennessee Titans   16    331  ...    27  32.1  11.8   -54.20
12   13    New Orleans Saints   16    341  ...    43  34.7  12.7   -41.89
13   14  Los Angeles Chargers   16    345  ...    28  37.3   8.2   -86.11
14   15   Philadelphia Eagles   16    354  ...    28  33.9  10.2   -29.57
15   16         New York Jets   16    359  ...    40  34.4  10.1    -0.06
16   17      Los Angeles Rams   16    364  ...    30  33.7  12.7   -11.53
17   18    Indianapolis Colts   16    373  ...    23  39.3  13.1   -58.37
18   19        Houston Texans   16    385  ...    28  39.3  13.1  -160.87
19   20      Cleveland Browns   16    393  ...    37  36.9  11.2   -91.15
20   21  Jacksonville Jaguars   16    397  ...    33  37.4   9.2  -120.09
21   22      Seattle Seahawks   16    398  ...    25  37.1  16.3   -92.02
22   23       Atlanta Falcons   16    399  ...    30  42.8   9.0  -105.34
23   24       Oakland Raiders   16    419  ...    52  41.2   8.5  -159.71
24   25    Cincinnati Bengals   16    420  ...    21  39.8   8.8  -132.66
25   26         Detroit Lions   16    423  ...    39  40.1   9.0  -142.55
26   27   Washington Redskins   16    435  ...    34  41.9  12.2  -135.83
27   28     Arizona Cardinals   16    442  ...    38  42.6   9.5  -174.55
28   29  Tampa Bay Buccaneers   16    449  ...    39  39.6  13.5    12.23
29   30       New York Giants   16    451  ...    32  39.7   8.7  -105.11
30   31     Carolina Panthers   16    470  ...    30  41.4   9.4  -116.88
31   32        Miami Dolphins   16    494  ...    34  45.6   8.8  -175.02
32  NaN              Avg Team  NaN  365.0  ...  32.9  36.0  11.8    -56.6
33  NaN          League Total  NaN  11680  ...  1054  36.0  11.8      NaN
34  NaN              Avg Tm/G  NaN   22.8  ...   2.1  36.0  11.8      NaN

[35 rows x 28 columns]
      Rk                    Tm     G      Cmp  ...  NY/A  ANY/A  Sk%     EXP
0    1.0   San Francisco 49ers  16.0    318.0  ...  4.80    4.6  8.5   58.30
1    2.0  New England Patriots  16.0    303.0  ...  5.00    3.5  8.1  117.74
2    3.0   Pittsburgh Steelers  16.0    314.0  ...  5.50    4.7  9.5   20.19
3    4.0         Buffalo Bills  16.0    348.0  ...  5.20    4.7  7.4   30.01
4    5.0  Los Angeles Chargers  16.0    328.0  ...  6.50    6.3  6.1  -92.16
5    6.0      Baltimore Ravens  16.0    318.0  ...  5.70    5.2  6.4   15.40
6    7.0      Cleveland Browns  16.0    318.0  ...  6.30    6.1  6.9  -64.09
7    8.0    Kansas City Chiefs  16.0    352.0  ...  5.70    5.2  7.2  -36.78
8    9.0         Chicago Bears  16.0    362.0  ...  5.90    5.7  5.3  -47.04
9   10.0        Dallas Cowboys  16.0    370.0  ...  5.90    6.1  6.4  -67.46
10  11.0        Denver Broncos  16.0    348.0  ...  6.30    6.1  6.9  -61.45
11  12.0      Los Angeles Rams  16.0    348.0  ...  5.90    5.7  8.2  -42.76
12  13.0     Carolina Panthers  16.0    347.0  ...  6.20    5.8  8.9  -63.03
13  14.0     Green Bay Packers  16.0    326.0  ...  6.30    5.7  7.0  -27.30
14  15.0     Minnesota Vikings  16.0    394.0  ...  5.80    5.3  7.4  -34.01
15  16.0  Jacksonville Jaguars  16.0    327.0  ...  6.70    6.7  8.3  -98.77
16  17.0         New York Jets  16.0    363.0  ...  6.10    6.0  5.6  -79.16
17  18.0   Washington Redskins  16.0    371.0  ...  6.50    6.7  7.8 -135.17
18  19.0   Philadelphia Eagles  16.0    348.0  ...  6.30    6.4  7.0  -88.15
19  20.0    New Orleans Saints  16.0    371.0  ...  5.90    5.8  7.8  -94.59
20  21.0    Cincinnati Bengals  16.0    308.0  ...  7.40    7.4  5.8 -126.81
21  22.0       Atlanta Falcons  16.0    351.0  ...  6.90    7.0  5.0 -128.75
22  23.0    Indianapolis Colts  16.0    394.0  ...  6.60    6.4  6.8  -86.44
23  24.0      Tennessee Titans  16.0    386.0  ...  6.40    6.2  6.7  -92.39
24  25.0       Oakland Raiders  16.0    337.0  ...  7.40    7.8  5.7 -177.69
25  26.0        Miami Dolphins  16.0    344.0  ...  7.40    7.7  4.0 -172.01
26  27.0      Seattle Seahawks  16.0    383.0  ...  6.70    6.2  4.5  -77.18
27  28.0       New York Giants  16.0    369.0  ...  7.10    7.4  6.1 -152.48
28  29.0        Houston Texans  16.0    375.0  ...  6.90    7.1  5.0 -160.60
29  30.0  Tampa Bay Buccaneers  16.0    408.0  ...  6.10    6.2  6.6  -38.17
30  31.0     Arizona Cardinals  16.0    421.0  ...  7.00    7.7  6.2 -190.81
31  32.0         Detroit Lions  16.0    381.0  ...  7.10    7.7  4.4 -162.94
32   NaN              Avg Team   NaN    354.1  ...  6.29    6.2  6.7  -73.60
33   NaN          League Total   NaN  11331.0  ...  6.29    6.2  6.7     NaN
34   NaN              Avg Tm/G   NaN     22.1  ...  6.29    6.2  6.7     NaN

[35 rows x 25 columns]
      Rk                    Tm     G      Att  ...     TD  Y/A    Y/G    EXP
0    1.0  Tampa Bay Buccaneers  16.0    362.0  ...   11.0  3.3   73.8  56.23
1    2.0         New York Jets  16.0    417.0  ...   12.0  3.3   86.9  72.34
2    3.0   Philadelphia Eagles  16.0    353.0  ...   13.0  4.1   90.1  47.64
3    4.0    New Orleans Saints  16.0    345.0  ...   12.0  4.2   91.3  39.45
4    5.0      Baltimore Ravens  16.0    340.0  ...   12.0  4.4   93.4  -1.25
5    6.0  New England Patriots  16.0    365.0  ...    7.0  4.2   95.5  33.13
6    7.0    Indianapolis Colts  16.0    383.0  ...    8.0  4.1   97.9  21.54
7    8.0       Oakland Raiders  16.0    405.0  ...   15.0  3.9   98.1  17.69
8    9.0         Chicago Bears  16.0    414.0  ...   16.0  3.9  102.0  38.83
9   10.0         Buffalo Bills  16.0    388.0  ...   12.0  4.3  103.1  10.92
10  11.0        Dallas Cowboys  16.0    407.0  ...   14.0  4.1  103.5  25.11
11  12.0      Tennessee Titans  16.0    415.0  ...   14.0  4.0  104.5  28.27
12  13.0     Minnesota Vikings  16.0    404.0  ...    8.0  4.3  108.0  21.01
13  14.0   Pittsburgh Steelers  16.0    462.0  ...    7.0  3.8  109.6  63.09
14  15.0       Atlanta Falcons  16.0    421.0  ...   13.0  4.2  110.9  17.98
15  16.0        Denver Broncos  16.0    426.0  ...    9.0  4.2  111.4  12.72
16  17.0   San Francisco 49ers  16.0    401.0  ...   11.0  4.5  112.6   9.91
17  18.0  Los Angeles Chargers  16.0    429.0  ...   15.0  4.2  112.8   1.08
18  19.0      Los Angeles Rams  16.0    444.0  ...   15.0  4.1  113.1  21.49
19  20.0       New York Giants  16.0    469.0  ...   19.0  3.9  113.3  40.51
20  21.0         Detroit Lions  16.0    455.0  ...   13.0  4.1  115.9  17.32
21  22.0      Seattle Seahawks  16.0    388.0  ...   22.0  4.9  117.7 -17.45
22  23.0     Green Bay Packers  16.0    411.0  ...   15.0  4.7  120.1 -42.18
23  24.0     Arizona Cardinals  16.0    439.0  ...    9.0  4.4  120.1  15.13
24  25.0        Houston Texans  16.0    403.0  ...   12.0  4.8  121.1  -6.34
25  26.0    Kansas City Chiefs  16.0    416.0  ...   14.0  4.9  128.2 -41.35
26  27.0        Miami Dolphins  16.0    485.0  ...   15.0  4.5  135.4  -6.14
27  28.0  Jacksonville Jaguars  16.0    435.0  ...   23.0  5.1  139.3 -21.95
28  29.0     Carolina Panthers  16.0    445.0  ...   31.0  5.2  143.5 -62.69
29  30.0      Cleveland Browns  16.0    463.0  ...   19.0  5.0  144.7 -37.50
30  31.0   Washington Redskins  16.0    493.0  ...   14.0  4.7  146.2  -6.89
31  32.0    Cincinnati Bengals  16.0    504.0  ...   17.0  4.7  148.9 -12.07
32   NaN              Avg Team   NaN    418.3  ...   14.0  4.3  112.9  11.10
33   NaN          League Total   NaN  13387.0  ...  447.0  4.3  112.9    NaN
34   NaN              Avg Tm/G   NaN     26.1  ...    0.9  4.3  112.9    NaN

[35 rows x 9 columns]