使用python3从网页中抓取特定表格(网页中有多个表格)

时间:2020-08-03 20:51:08

标签: python web-scraping

我正在尝试从网页上的特定表中提取数据。该页面上有多个表,因此我尝试使用表ID仅提取所需的表。

url:https://basketball.realgm.com/player/Luke-Nelson/Summary/50483

我到目前为止的代码如下。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import ssl


# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#URL input
url = 'https://basketball.realgm.com/player/Luke-Nelson/Summary/50483'
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

table = soup.find('table', id='table-1696')
print(table)

我假设print语句将从表中打印HTML(以前只在一个表上工作过),但是当我运行程序时,它具有以下输出:

Terminal Output

最终,我的目标是在python中重新创建表并导出到excel,但无法克服第一个障碍!

这是网页中表格的HTML

<table class="tablesaw compact tablesaw-swipe tablesaw-sortable" data-tablesaw-mode="swipe" data-tablesaw-mode-switch="" data-tablesaw-mode-exclude="columntoggle" data-tablesaw-sortable="" data-tablesaw-sortable-switch="" id="table-1696" style="">
<thead><tr class="per_game per_48 per_40 per_36 per_minute minute_per total">
<th data-tablesaw-sortable-col="" data-tablesaw-priority="persist" data-tablesaw-sortable-default-col="" class="tablesaw-cell-persist tablesaw-sortable-head tablesaw-sortable-ascending"><button class="tablesaw-sortable-btn">Season</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">Team</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">League</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">GP</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">GS</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">MIN</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FGM</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FGA</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FG%</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">3PM</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">3PA</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">3P%</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FTM</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FTA</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">FT%</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">OFF</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">DEF</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">TRB</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">AST</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head"><button class="tablesaw-sortable-btn">STL</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head tablesaw-cell-hidden"><button class="tablesaw-sortable-btn">BLK</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head tablesaw-cell-hidden"><button class="tablesaw-sortable-btn">PF</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head tablesaw-cell-hidden"><button class="tablesaw-sortable-btn">TOV</button></th>
<th data-tablesaw-sortable-col="" class="tablesaw-sortable-head tablesaw-cell-hidden"><button class="tablesaw-sortable-btn">PTS</button></th>
</tr></thead><tbody><tr class="per_game">
<td class="tablesaw-cell-persist">2012-13</td>
<td id="teamLineinternational_reg_Per_Game_1"><a href="/international/league/47/adidas-Next-Generation-Tournament/team/1304/Team-England-U18-Men">Team England U18 Men</a></td>
<td><a href="/international/league/47/adidas-Next-Generation-Tournament">ANGT</a></td>
<td>3</td>
<td>3</td>
<td>33.3</td>
<td>6.00</td>
<td>16.33</td>
<td>.367</td>
<td>1.33</td>
<td>4.33</td>
<td>.308</td>
<td>2.33</td>
<td>2.67</td>
<td>.875</td>
<td>0.00</td>
<td>3.33</td>
<td>3.33</td>
<td>5.67</td>
<td>2.00</td>
<td class="tablesaw-cell-hidden">0.33</td>
<td class="tablesaw-cell-hidden">3.00</td>
<td class="tablesaw-cell-hidden">3.67</td>
<td class="tablesaw-cell-hidden">15.67</td>
</tr>
<tr class="per_game">
<td class="tablesaw-cell-persist">2017-18</td>
<td id="teamLineinternational_reg_Per_Game_2"><a href="/international/league/4/Spanish-ACB/team/212/Coosur-Real-Betis">Coosur Real Betis</a></td>
<td><a href="/international/league/4/Spanish-ACB">ACB</a></td>
<td>34</td>
<td>28</td>
<td>23.2</td>
<td>2.97</td>
<td>6.74</td>
<td>.441</td>
<td>1.47</td>
<td>3.59</td>
<td>.410</td>
<td>0.79</td>
<td>1.03</td>
<td>.771</td>
<td>0.24</td>
<td>1.91</td>
<td>2.15</td>
<td>1.68</td>
<td>1.06</td>
<td class="tablesaw-cell-hidden">0.03</td>
<td class="tablesaw-cell-hidden">3.00</td>
<td class="tablesaw-cell-hidden">1.82</td>
<td class="tablesaw-cell-hidden">8.21</td>
</tr>
<tr class="per_game">
<td class="tablesaw-cell-persist">2019-20 *</td>
<td id="teamLineinternational_reg_Per_Game_3">All Teams</td>
<td>All Leagues</td>
<td>17</td>
<td>5</td>
<td>16.7</td>
<td>2.82</td>
<td>7.29</td>
<td>.387</td>
<td>1.35</td>
<td>3.88</td>
<td>.348</td>
<td>1.35</td>
<td>1.59</td>
<td>.852</td>
<td>0.24</td>
<td>0.94</td>
<td>1.18</td>
<td>2.47</td>
<td>0.71</td>
<td class="tablesaw-cell-hidden">0.18</td>
<td class="tablesaw-cell-hidden">2.24</td>
<td class="tablesaw-cell-hidden">1.59</td>
<td class="tablesaw-cell-hidden">8.35</td>
</tr>
<tr class="per_game multiple-teams-highlight">
<td class="tablesaw-cell-persist">2019-20 *</td>
<td id="teamLineinternational_reg_Per_Game_4"><a href="/international/league/4/Spanish-ACB/team/473/ICL-Manresa">ICL Manresa</a></td>
<td><a href="/international/league/4/Spanish-ACB">ACB</a></td>
<td>9</td>
<td>1</td>
<td>13.6</td>
<td>1.78</td>
<td>5.56</td>
<td>.320</td>
<td>0.56</td>
<td>2.89</td>
<td>.192</td>
<td>1.56</td>
<td>1.67</td>
<td>.933</td>
<td>0.33</td>
<td>0.78</td>
<td>1.11</td>
<td>1.89</td>
<td>0.22</td>
<td class="tablesaw-cell-hidden">0.00</td>
<td class="tablesaw-cell-hidden">1.89</td>
<td class="tablesaw-cell-hidden">1.56</td>
<td class="tablesaw-cell-hidden">5.67</td>
</tr>
<tr class="per_game multiple-teams-highlight">
<td class="tablesaw-cell-persist">2019-20 *</td>
<td id="teamLineinternational_reg_Per_Game_5"><a href="/international/league/106/Basketball-Champions-League-Europe/team/473/ICL-Manresa">ICL Manresa</a></td>
<td><a href="/international/league/106/Basketball-Champions-League-Europe">BCL-Eu</a></td>
<td>8</td>
<td>4</td>
<td>20.3</td>
<td>4.00</td>
<td>9.25</td>
<td>.432</td>
<td>2.25</td>
<td>5.00</td>
<td>.450</td>
<td>1.12</td>
<td>1.50</td>
<td>.750</td>
<td>0.12</td>
<td>1.12</td>
<td>1.25</td>
<td>3.12</td>
<td>1.25</td>
<td class="tablesaw-cell-hidden">0.38</td>
<td class="tablesaw-cell-hidden">2.62</td>
<td class="tablesaw-cell-hidden">1.62</td>
<td class="tablesaw-cell-hidden">11.38</td>
</tr>
</tbody>
<tfoot></tfoot>
</table>

感谢您抽出宝贵的时间阅读我的问题,并希望我已对它进行了充分的解释。我对编码/编程非常陌生(几周前开始),因此请在回答时记住这一点。再次感谢!

3 个答案:

答案 0 :(得分:0)

您可以使用熊猫:

import pandas as pd

df = pd.read_html(url) # df -> list of tables

print(len(df)) # 29 

您可以选择所需的表格。

答案 1 :(得分:0)

表ID是动态分配的,因此我建议您使用另一种方式访问​​您的表。假设您想获得NBA夏季联赛统计数据-总计,请尝试:

table_heading = 'NBA Summer League Stats - Totals'
table = soup.find(string=re.compile(table_heading))
          .find_parent()
          .find_next_sibling()
print(table)

您可以为表中的其他标题更改table_heading。让我知道是否有帮助。

答案 2 :(得分:0)

使用熊猫获取表格标签,并使用id属性选择所需的标签:

import pandas as pd

url = 'https://basketball.realgm.com/player/Luke-Nelson/Summary/50483'
df = pd.read_html(url, attrs={'id':'table-1696'})[0]