如何在for循环中通过Web抓取表格并在每次迭代后合并这些表格?

时间:2019-08-20 00:54:15

标签: python pandas beautifulsoup

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

age = 23
final = pd.DataFrame(columns =['BPM','MP'])
stats = []

headers  = ["Player", "Season", "Age", "Tm", "Lg", "BPM", "G", "GS", "MP", "FG", "FGA", "2P", "2PA", "3P", "3PA", "FT", "FTA", "ORB", "DRB", "TRB", "AST", "STL", "BLK", "TOV", "PF", "PTS", "FG%", "2P%", "3P%", "eFG%", "FT%", "TS%"]

for offset in [0,100]:

    url = "https://www.basketball-reference.com/play-index/psl_finder.cgi?request=1&match=single&type=totals&per_minute_base=36&per_poss_base=100&season_start=1&season_end=-1&lg_id=NBA&age_min={}&age_max={}&is_playoffs=N&height_min=0&height_max=99&year_min=2001&birth_country_is=Y&as_comp=gt&as_val=0&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&order_by=bpm&offset={}".format(age,age,offset)

    html = urlopen(url)
    soup = BeautifulSoup(html)

    soup.findAll('tr', limit=2)

    rows = soup.findAll('tr')[1:]
    player_stats = [[td.getText() for td in rows[i].findAll('td')]
        for i in range(len(rows))]

    stats = pd.DataFrame(player_stats, columns = headers)

    stats = stats.mask(stats.eq('None')).dropna()

    stats = stats.append(stats)

因此,如果偏移量为0,则“ stats”是某个100行表(假设表A)。当偏移量为100时,“ stats”是另一个100行表(表B)。我想这样做,以便将两个不同表的结果简单地组合成一个更大的表。

运行此代码后,“ stats”成为一个200行的表,但只是表B重复了两次。如何获得表A +表B?

如果有关系,可以将其扩展为偏移量[0,100,200,300,400,500,600,700,800,900,100],但我认为对此适用的解决方案也可以适用。

2 个答案:

答案 0 :(得分:1)

这里使用pd.read_html并使用header=1指定列名的方法更为简洁。然后,您可以将数据帧列表传递到pd.concat()并设置玩家排名的索引(Rk):

import pandas as pd

age = 23

my_url = "https://www.basketball-reference.com/play-index/psl_finder.cgi?request=1&match=single&type=totals&per_minute_base=36&per_poss_base=100&season_start=1&season_end=-1&lg_id=NBA&age_min={}&age_max={}&is_playoffs=N&height_min=0&height_max=99&year_min=2001&birth_country_is=Y&as_comp=gt&as_val=0&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&order_by=bpm&offset={}"

df = pd.concat([pd.read_html(my_url.format(age,age,offset), header=1)[0] for offset in [0,100]]).set_index('Rk')

这是输出的快照:

              Player   Season Age   Tm   Lg  ...    2P%   3P%   eFG%   FT%   TS%
Rk                                           ...                                
1     Jarnell Stokes  2016-17  23  DEN  NBA  ...  1.000   NaN  1.000  .500  .798
2       LeBron James  2007-08  23  CLE  NBA  ...   .531  .315   .518  .712  .568
3         Chris Paul  2008-09  23  NOH  NBA  ...   .525  .364   .528  .868  .599
4      Tracy McGrady  2002-03  23  ORL  NBA  ...   .481  .386   .505  .793  .564
5       Nikola Jokić  2018-19  23  DEN  NBA  ...   .569  .307   .545  .821  .589
..               ...      ...  ..  ...  ...  ...    ...   ...    ...   ...   ...
196    Tyler Johnson  2015-16  23  MIA  NBA  ...   .529  .380   .541  .797  .579
197     Luke Ridnour  2004-05  23  SEA  NBA  ...   .414  .376   .450  .883  .504
198     Cole Aldrich  2011-12  23  OKC  NBA  ...   .524   NaN   .524  .929  .592
199  Leandro Barbosa  2005-06  23  PHO  NBA  ...   .501  .444   .558  .755  .589
200      Eric Gordon  2011-12  23  NOH  NBA  ...   .530  .250   .486  .754  .549

答案 1 :(得分:0)

您只需在

的for循环外初始化stats一次。
stats = pd.DataFrame(columns = headers)

在for循环中,像执行操作一样追加数据。现在,您每次在for循环中运行stats = pd.DataFrame(player_stats, columns = headers)时都会初始化一个新的数据帧,以擦除旧数据。