我正在尝试根据先前的配对信息在数据框中创建3个新列。
您可以将数据框视为不同日期(“日期”列)不同类型(“类型”列)中的竞争结果(“ xx”列)。
想法是创建以下新列:
(i)numb_comp_past:过去每种类型面对竞争对手的次数的总和。
(ii)win_comp_past:过去所有类型彼此竞争过的先前比赛的获胜(+1),平局(+0)和损失(-1)之和。
(iii)win_comp_past_difs:过去所有类型彼此竞争过的先前比赛的结果之和。
原始数据帧(df)如下:
idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Mar-18', 'Mar-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'D', 'E', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'xx': 1}, {'xx': 5}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3},{'xx': 1}, {'xx': 6}, {'xx': 3}, {'xx': 5}, {'xx': 2}, {'xx': 3},{'xx': 1}, {'xx': 9}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3}, {'xx': 6}, {'xx': 8}, {'xx': 2}, {'xx': 7}, {'xx': 9}]
df = pd.DataFrame(data, index=idx, columns=['xx'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y')
df=df.set_index(['date','type'])
df['xx'] = df.xx.astype('float')
它看起来像这样:
xx
date type
2018-01-01 A 1.0
B 5.0
2018-02-01 B 3.0
2018-03-01 A 2.0
B 7.0
C 3.0
D 1.0
E 6.0
2018-05-01 B 3.0
2018-06-01 A 5.0
B 2.0
C 3.0
2018-07-01 A 1.0
2018-08-01 B 9.0
C 3.0
2018-09-01 A 2.0
B 7.0
2018-10-01 C 3.0
A 6.0
B 8.0
2018-11-01 A 2.0
2018-12-01 B 7.0
C 9.0
我需要添加到数据框中的3个新列如下所示(Pandas代码的预期输出):
xx numb_comp_past win_comp_past win_comp_past_difs
date type
2018-01-01 A 1.0 0.0 0.0 0.0
B 5.0 0.0 0.0 0.0
2018-02-01 B 3.0 0.0 0.0 0.0
2018-03-01 A 2.0 1.0 -1.0 -4.0
B 7.0 1.0 1.0 4.0
C 3.0 0.0 0.0 0.0
D 1.0 0.0 0.0 0.0
E 6.0 0.0 0.0 0.0
2018-05-01 B 3.0 0.0 0.0 0.0
2018-06-01 A 5.0 3.0 -3.0 -10.0
B 2.0 3.0 3.0 13.0
C 3.0 2.0 0.0 -3.0
2018-07-01 A 1.0 0.0 0.0 0.0
2018-08-01 B 9.0 2.0 0.0 3.0
C 3.0 2.0 0.0 -3.0
2018-09-01 A 2.0 3.0 -1.0 -6.0
B 7.0 3.0 1.0 6.0
2018-10-01 C 3.0 5.0 -1.0 -10.0
A 6.0 6.0 -2.0 -10.0
B 8.0 7.0 3.0 20.0
2018-11-01 A 2.0 0.0 0.0 0.0
2018-12-01 B 7.0 4.0 2.0 14.0
C 9.0 4.0 -2.0 -14.0
请注意:
如果没有以前的比赛,我会为numb_comp_past的(i)分配0值。例如,在2018-06-01年,鉴于他先前在2018- 01-01和2018-03-01,并且类型C于2018-03-01。
(ii)对于win_comp_past,如果以前没有比赛,我将其赋值为0。例如,在2018-06-01,鉴于他先前在2018年输给类型B,类型A的值为-3 -01-01(-1)和2018-03-01(-1)并在2018-03-01(-1)上使用C型。因此添加-1-1-1 = -3。
(iii)对于win_comp_past_value,如果之前没有比赛,我将其分配为0。例如,在2018-06-01,鉴于他先前在2018年输给B的情况下,A型的值为-10 -01-01相差-4(= 1-5),2018年1月1日相差-5(= 2-7),C型在2018-03-01时相差-1(= 2-3)。因此添加-4-5-1 = -10。
我真的不知道如何开始解决这个问题。任何关于如何解决(i),(ii)和(ii)中描述的新列的想法都非常受欢迎。
答案 0 :(得分:1)
这是我的看法:
# get differences of pairs, useful for win counts and win_difs
def get_diff(x):
teams = x.index.get_level_values(1)
tmp = pd.DataFrame(x[:,None]-x[None,:],
columns = teams.values,
index=teams.values).stack()
return tmp[tmp.index.get_level_values(0)!=tmp.index.get_level_values(1)]
new_df = df.groupby('date').xx.apply(get_diff).to_frame()
# win matches
new_df['win'] = new_df.xx.ge(0).astype(int) - new_df.xx.le(0).astype(int)
# group by players
groups = new_df.groupby(level=[1,2])
# sum function
def cumsum_shift(x):
return x.cumsum().shift()
# assign new values
df['num_comp_past'] = groups.xx.cumcount().sum(level=[0,1])
df['win_comp_past'] = groups.win.apply(cumsum_shift).sum(level=[0,1])
df['win_comp_past_difs'] = groups.xx.apply(cumsum_shift).sum(level=[0,1])
输出:
+------------+------+-----+---------------+---------------+--------------------+
| | | xx | num_comp_past | win_comp_past | win_comp_past_difs |
+------------+------+-----+---------------+---------------+--------------------+
| date | type | | | | |
+------------+------+-----+---------------+---------------+--------------------+
| 2018-01-01 | A | 1.0 | 0.0 | 0.0 | 0.0 |
| | B | 5.0 | 0.0 | 0.0 | 0.0 |
| 2018-02-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-03-01 | A | 2.0 | 1.0 | -1.0 | -4.0 |
| | B | 7.0 | 1.0 | 1.0 | 4.0 |
| | C | 3.0 | 0.0 | 0.0 | 0.0 |
| | D | 1.0 | 0.0 | 0.0 | 0.0 |
| | E | 6.0 | 0.0 | 0.0 | 0.0 |
| 2018-05-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-06-01 | A | 5.0 | 3.0 | -3.0 | -10.0 |
| | B | 2.0 | 3.0 | 3.0 | 13.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-07-01 | A | 1.0 | NaN | NaN | NaN |
| 2018-08-01 | B | 9.0 | 2.0 | 0.0 | 3.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-09-01 | A | 2.0 | 3.0 | -1.0 | -6.0 |
| | B | 7.0 | 3.0 | 1.0 | 6.0 |
| 2018-10-01 | C | 3.0 | 5.0 | -1.0 | -10.0 |
| | A | 6.0 | 6.0 | -2.0 | -10.0 |
| | B | 8.0 | 7.0 | 3.0 | 20.0 |
| 2018-11-01 | A | 2.0 | NaN | NaN | NaN |
| 2018-12-01 | B | 7.0 | 4.0 | 2.0 | 14.0 |
| | C | 9.0 | 4.0 | -2.0 | -14.0 |
| 2018-01-01 | A | 1.0 | 0.0 | 0.0 | 0.0 |
| | B | 5.0 | 0.0 | 0.0 | 0.0 |
| 2018-02-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-03-01 | A | 2.0 | 1.0 | -1.0 | -4.0 |
| | B | 7.0 | 1.0 | 1.0 | 4.0 |
| | C | 3.0 | 0.0 | 0.0 | 0.0 |
| | D | 1.0 | 0.0 | 0.0 | 0.0 |
| | E | 6.0 | 0.0 | 0.0 | 0.0 |
| 2018-05-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-06-01 | A | 5.0 | 3.0 | -3.0 | -10.0 |
| | B | 2.0 | 3.0 | 3.0 | 13.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-07-01 | A | 1.0 | NaN | NaN | NaN |
| 2018-08-01 | B | 9.0 | 2.0 | 0.0 | 3.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-09-01 | A | 2.0 | 3.0 | -1.0 | -6.0 |
| | B | 7.0 | 3.0 | 1.0 | 6.0 |
| 2018-10-01 | C | 3.0 | 5.0 | -1.0 | -10.0 |
| | A | 6.0 | 6.0 | -2.0 | -10.0 |
| | B | 8.0 | 7.0 | 3.0 | 20.0 |
| 2018-11-01 | A | 2.0 | NaN | NaN | NaN |
| 2018-12-01 | B | 7.0 | 4.0 | 2.0 | 14.0 |
| | C | 9.0 | 4.0 | -2.0 | -14.0 |
+------------+------+-----+---------------+---------------+--------------------+