from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
age = 23
final = pd.DataFrame(columns =['BPM','MP'])
stats = []
headers = ["Player", "Season", "Age", "Tm", "Lg", "BPM", "G", "GS", "MP", "FG", "FGA", "2P", "2PA", "3P", "3PA", "FT", "FTA", "ORB", "DRB", "TRB", "AST", "STL", "BLK", "TOV", "PF", "PTS", "FG%", "2P%", "3P%", "eFG%", "FT%", "TS%"]
for offset in [0,100]:
url = "https://www.basketball-reference.com/play-index/psl_finder.cgi?request=1&match=single&type=totals&per_minute_base=36&per_poss_base=100&season_start=1&season_end=-1&lg_id=NBA&age_min={}&age_max={}&is_playoffs=N&height_min=0&height_max=99&year_min=2001&birth_country_is=Y&as_comp=gt&as_val=0&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&order_by=bpm&offset={}".format(age,age,offset)
html = urlopen(url)
soup = BeautifulSoup(html)
soup.findAll('tr', limit=2)
rows = soup.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
stats = pd.DataFrame(player_stats, columns = headers)
stats = stats.mask(stats.eq('None')).dropna()
stats = stats.append(stats)
因此,如果偏移量为0,则“ stats”是某个100行表(假设表A)。当偏移量为100时,“ stats”是另一个100行表(表B)。我想这样做,以便将两个不同表的结果简单地组合成一个更大的表。
运行此代码后,“ stats”成为一个200行的表,但只是表B重复了两次。如何获得表A +表B?
如果有关系,可以将其扩展为偏移量[0,100,200,300,400,500,600,700,800,900,100],但我认为对此适用的解决方案也可以适用。
答案 0 :(得分:1)
这里使用pd.read_html
并使用header=1
指定列名的方法更为简洁。然后,您可以将数据帧列表传递到pd.concat()
并设置玩家排名的索引(Rk
):
import pandas as pd
age = 23
my_url = "https://www.basketball-reference.com/play-index/psl_finder.cgi?request=1&match=single&type=totals&per_minute_base=36&per_poss_base=100&season_start=1&season_end=-1&lg_id=NBA&age_min={}&age_max={}&is_playoffs=N&height_min=0&height_max=99&year_min=2001&birth_country_is=Y&as_comp=gt&as_val=0&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&order_by=bpm&offset={}"
df = pd.concat([pd.read_html(my_url.format(age,age,offset), header=1)[0] for offset in [0,100]]).set_index('Rk')
这是输出的快照:
Player Season Age Tm Lg ... 2P% 3P% eFG% FT% TS%
Rk ...
1 Jarnell Stokes 2016-17 23 DEN NBA ... 1.000 NaN 1.000 .500 .798
2 LeBron James 2007-08 23 CLE NBA ... .531 .315 .518 .712 .568
3 Chris Paul 2008-09 23 NOH NBA ... .525 .364 .528 .868 .599
4 Tracy McGrady 2002-03 23 ORL NBA ... .481 .386 .505 .793 .564
5 Nikola Jokić 2018-19 23 DEN NBA ... .569 .307 .545 .821 .589
.. ... ... .. ... ... ... ... ... ... ... ...
196 Tyler Johnson 2015-16 23 MIA NBA ... .529 .380 .541 .797 .579
197 Luke Ridnour 2004-05 23 SEA NBA ... .414 .376 .450 .883 .504
198 Cole Aldrich 2011-12 23 OKC NBA ... .524 NaN .524 .929 .592
199 Leandro Barbosa 2005-06 23 PHO NBA ... .501 .444 .558 .755 .589
200 Eric Gordon 2011-12 23 NOH NBA ... .530 .250 .486 .754 .549
答案 1 :(得分:0)
您只需在
的for循环外初始化stats
一次。
stats = pd.DataFrame(columns = headers)
在for循环中,像执行操作一样追加数据。现在,您每次在for循环中运行stats = pd.DataFrame(player_stats, columns = headers)
时都会初始化一个新的数据帧,以擦除旧数据。