我正在分析随时间推移的PGA巡回赛数据。出于机器学习的目的,我希望列数据可以代表几周的统计数据。以下是原始数据结构的示例。
import pandas as pd
import numpy as np
data = {'Player Name':['Tiger','Tiger','Tiger','Tiger','Tiger','Tiger','Jack',
'Jack','Jack','Jack','Jack','Jack','Jack'],
'Date':[1, 2, 4, 6, 7, 9, 1, 3, 4, 6, 9, 10, 11],
'SG Total':[13, 2, 14, 6, 8, 1, 1, 3, 8, 4, 9, 2, 1]}
df_original = pd.DataFrame(data)
我想以以下格式获取数据。
data = {'Player Name':['Tiger','Tiger','Tiger','Jack','Jack',
'Jack','Jack'],
'Date':[6, 7, 9, 6, 9, 10, 11],
'SG Total (Date t-3)':[13, 2, 14, 1, 3, 8, 4],
'SG Total (Date t-2)':[2, 14, 6, 3, 8, 4, 9],
'SG Total (Date t-1)':[14, 6, 8, 8, 4, 9, 2],
'SG Total (Date y)': [6, 8, 1, 4, 9, 2, 1]}
df_correct = pd.DataFrame(data)
在我使用的真实数据集中,我大约有1000列。因此,新的所需数据集将可能具有4000列。正如您在所需数据集中看到的那样,我删除了每个玩家的前三周。我使用个人的前3周来填写(t-3),(t-2)和(t-1)
,因此我从该数据的第4周开始输入日期最初,我每周都创建一个数据集,无论玩家是否玩过,并使用此代码创建所需的DataFrame。
#%% Creates weekly dataframes & predictions dataframes
#Creates dataframes of each week
dict_of_weeks = {}
for i in range(1,df_numeric_combined['Date'].nunique()+1):
dict_of_weeks['Week_{}_df'.format(i)] = df_numeric_combined[df_numeric_combined['Date'] == i]
dict_of_weeks['Week_{}_df'.format(i)].columns += ' (Week ' + str(i) + ')'
dict_of_weeks['Week_{}_df'.format(i)].rename(columns={'Player Name (Week ' + str(i) + ')' : 'Player Name' , 'Date (Week ' + str(i) + ')' : 'Date'},inplace=True)
#Creating dataframes for prediction of each week
import functools
dict_of_predictions = {}
df_weeks = []
for i in range(4,df_numeric_combined['Date'].nunique()+1):
dfs = [dict_of_weeks['Week_'+str(i-3)+'_df'], dict_of_weeks['Week_'+str(i-2)+'_df'], dict_of_weeks['Week_'+str(i-1)+'_df'], dict_of_weeks['Week_'+str(i)+'_df']]
dict_of_predictions['Week_{}_predictions'.format(i)] = functools.reduce(lambda left,right: pd.merge(left,right,on=['Player Name'], how='outer'), dfs)
cols = []
count = 1
for column in dict_of_predictions['Week_{}_predictions'.format(i)].columns:
if column == 'Date_y':
cols.append('Date_y_'+ str(count))
count+=1
continue
cols.append(column)
dict_of_predictions['Week_{}_predictions'.format(i)].columns = cols
dict_of_predictions['Week_{}_predictions'.format(i)].drop(columns = ['Date_x', 'Date_y_1'],inplace = True)
dict_of_predictions['Week_{}_predictions'.format(i)].rename(columns={'Date_y_2':'Date'},inplace=True)
dict_of_predictions['Week_{}_predictions'.format(i)].columns = dict_of_predictions['Week_{}_predictions'.format(i)].columns.str.replace('(Week ' + str(i-3)+ ')', 'Week t-3').str.replace('(Week ' + str(i-2)+ ')', 'Week t-2').str.replace('(Week ' + str(i-1)+ ')', 'Week t-1').str.replace('(Week ' + str(i)+ ')', 'Week y')
df_weeks.append(dict_of_predictions['Week_{}_predictions'.format(i)])
#Combines predictions dataframes
df = pd.concat(dict_of_predictions.values(), axis=0, join='inner')
然而,我创建的此代码仅在玩家连续玩了几周的情况下才有效,因为它取决于周数并减去3、2和1。
最终目标是以df_correct格式获取数据。
谢谢!
答案 0 :(得分:2)
如果我正确理解了您的要求,则可以在shift
的排序数据框中使用groupby
,以完成每个玩家的previous
周成绩:
## Sort first by player and date
df_corrected = df_original.sort_values(['Player Name','Date'])
your_columns = ['SG Total'] ## name your 4000 columns here
for col in your_columns:
for s in [3,2,1,0]: ### time lapses
df_corrected[f'{col} (Date t-{s})'] = df_corrected.groupby('Player Name')[col].shift(s)
df_corrected.drop(your_columns, axis=1, inplace=True)
哪个输出
Out[12]:
Player Name Date SG Total (Date t-3) SG Total (Date t-2) \
6 Jack 1 NaN NaN
7 Jack 3 NaN NaN
8 Jack 4 NaN 1.0
9 Jack 6 1.0 3.0
10 Jack 9 3.0 8.0
11 Jack 10 8.0 4.0
12 Jack 11 4.0 9.0
0 Tiger 1 NaN NaN
1 Tiger 2 NaN NaN
2 Tiger 4 NaN 13.0
3 Tiger 6 13.0 2.0
4 Tiger 7 2.0 14.0
5 Tiger 9 14.0 6.0
SG Total (Date t-1) SG Total (Date t-0)
6 NaN 1
7 1.0 3
8 3.0 8
9 8.0 4
10 4.0 9
11 9.0 2
12 2.0 1
0 NaN 13
1 13.0 2
2 2.0 14
3 14.0 6
4 6.0 8
5 8.0 1