使用Pandas根据先前所有对信息构造新列

时间:2019-05-14 16:47:26

标签: python pandas dataframe

我正在尝试根据先前的配对信息在数据框中创建3个新列。

您可以将数据框视为不同日期(“日期”列)不同类型(“类型”列)中的竞争结果(“ xx”列)。

想法是创建以下新列:

(i)numb_comp_past:过去每种类型面对竞争对手的次数的总和。

(ii)win_comp_past:过去所有类型彼此竞争过的先前比赛的获胜(+1),平局(+0)和损失(-1)之和。

(iii)win_comp_past_difs:过去所有类型彼此竞争过的先前比赛的结果之和。

  • 原始数据帧(df)如下:

    idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Mar-18', 'Mar-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'D', 'E', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
    data = [{'xx': 1}, {'xx': 5}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3},{'xx': 1}, {'xx': 6}, {'xx': 3}, {'xx': 5}, {'xx': 2}, {'xx': 3},{'xx': 1}, {'xx': 9}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3}, {'xx': 6}, {'xx': 8}, {'xx': 2}, {'xx': 7}, {'xx': 9}]
    df = pd.DataFrame(data, index=idx, columns=['xx'])
    df.index.names=['date','type']
    df=df.reset_index()
    df['date'] = pd.to_datetime(df['date'],format = '%b-%y') 
    df=df.set_index(['date','type'])
    df['xx'] = df.xx.astype('float')
    

它看起来像这样:

                  xx
date       type
2018-01-01 A     1.0
           B     5.0
2018-02-01 B     3.0
2018-03-01 A     2.0
           B     7.0
           C     3.0
           D     1.0
           E     6.0
2018-05-01 B     3.0
2018-06-01 A     5.0
           B     2.0
           C     3.0
2018-07-01 A     1.0
2018-08-01 B     9.0
           C     3.0
2018-09-01 A     2.0
           B     7.0
2018-10-01 C     3.0
           A     6.0
           B     8.0
2018-11-01 A     2.0
2018-12-01 B     7.0
           C     9.0
  • 我需要添加到数据框中的3个新列如下所示(Pandas代码的预期输出):

                      xx  numb_comp_past  win_comp_past  win_comp_past_difs
    date       type
    2018-01-01 A     1.0             0.0            0.0                 0.0
               B     5.0             0.0            0.0                 0.0
    2018-02-01 B     3.0             0.0            0.0                 0.0
    2018-03-01 A     2.0             1.0           -1.0                -4.0
               B     7.0             1.0            1.0                 4.0
               C     3.0             0.0            0.0                 0.0
               D     1.0             0.0            0.0                 0.0
               E     6.0             0.0            0.0                 0.0
    2018-05-01 B     3.0             0.0            0.0                 0.0
    2018-06-01 A     5.0             3.0           -3.0               -10.0
               B     2.0             3.0            3.0                13.0
               C     3.0             2.0            0.0                -3.0
    2018-07-01 A     1.0             0.0            0.0                 0.0
    2018-08-01 B     9.0             2.0            0.0                 3.0
               C     3.0             2.0            0.0                -3.0
    2018-09-01 A     2.0             3.0           -1.0                -6.0
               B     7.0             3.0            1.0                 6.0
    2018-10-01 C     3.0             5.0           -1.0               -10.0
               A     6.0             6.0           -2.0               -10.0
               B     8.0             7.0            3.0                20.0
    2018-11-01 A     2.0             0.0            0.0                 0.0
    2018-12-01 B     7.0             4.0            2.0                14.0
               C     9.0             4.0           -2.0               -14.0
    

请注意:

如果没有以前的比赛,我会为numb_comp_past的

(i)分配0值。例如,在2018-06-01年,鉴于他先前在2018- 01-01和2018-03-01,并且类型C于2018-03-01。

(ii)对于win_comp_past,如果以前没有比赛,我将其赋值为0。例如,在2018-06-01,鉴于他先前在2018年输给类型B,类型A的值为-3 -01-01(-1)和2018-03-01(-1)并在2018-03-01(-1)上使用C型。因此添加-1-1-1 = -3。

(iii)对于win_comp_past_value,如果之前没有比赛,我将其分配为0。例如,在2018-06-01,鉴于他先前在2018年输给B的情况下,A型的值为-10 -01-01相差-4(= 1-5),2018年1月1日相差-5(= 2-7),C型在2018-03-01时相差-1(= 2-3)。因此添加-4-5-1 = -10。

我真的不知道如何开始解决这个问题。任何关于如何解决(i),(ii)和(ii)中描述的新列的想法都非常受欢迎。

1 个答案:

答案 0 :(得分:1)

这是我的看法:

# get differences of pairs, useful for win counts and win_difs
def get_diff(x):
    teams = x.index.get_level_values(1)
    tmp = pd.DataFrame(x[:,None]-x[None,:],
                       columns = teams.values,
                       index=teams.values).stack()
    return tmp[tmp.index.get_level_values(0)!=tmp.index.get_level_values(1)]

new_df = df.groupby('date').xx.apply(get_diff).to_frame()

# win matches
new_df['win'] = new_df.xx.ge(0).astype(int) - new_df.xx.le(0).astype(int)

# group by players
groups = new_df.groupby(level=[1,2])

# sum function
def cumsum_shift(x):
    return x.cumsum().shift()

# assign new values
df['num_comp_past'] = groups.xx.cumcount().sum(level=[0,1])
df['win_comp_past'] = groups.win.apply(cumsum_shift).sum(level=[0,1])
df['win_comp_past_difs'] = groups.xx.apply(cumsum_shift).sum(level=[0,1])

输出:

+------------+------+-----+---------------+---------------+--------------------+
|            |      | xx  | num_comp_past | win_comp_past | win_comp_past_difs |
+------------+------+-----+---------------+---------------+--------------------+
| date       | type |     |               |               |                    |
+------------+------+-----+---------------+---------------+--------------------+
| 2018-01-01 | A    | 1.0 | 0.0           | 0.0           | 0.0                |
|            | B    | 5.0 | 0.0           | 0.0           | 0.0                |
| 2018-02-01 | B    | 3.0 | NaN           | NaN           | NaN                |
| 2018-03-01 | A    | 2.0 | 1.0           | -1.0          | -4.0               |
|            | B    | 7.0 | 1.0           | 1.0           | 4.0                |
|            | C    | 3.0 | 0.0           | 0.0           | 0.0                |
|            | D    | 1.0 | 0.0           | 0.0           | 0.0                |
|            | E    | 6.0 | 0.0           | 0.0           | 0.0                |
| 2018-05-01 | B    | 3.0 | NaN           | NaN           | NaN                |
| 2018-06-01 | A    | 5.0 | 3.0           | -3.0          | -10.0              |
|            | B    | 2.0 | 3.0           | 3.0           | 13.0               |
|            | C    | 3.0 | 2.0           | 0.0           | -3.0               |
| 2018-07-01 | A    | 1.0 | NaN           | NaN           | NaN                |
| 2018-08-01 | B    | 9.0 | 2.0           | 0.0           | 3.0                |
|            | C    | 3.0 | 2.0           | 0.0           | -3.0               |
| 2018-09-01 | A    | 2.0 | 3.0           | -1.0          | -6.0               |
|            | B    | 7.0 | 3.0           | 1.0           | 6.0                |
| 2018-10-01 | C    | 3.0 | 5.0           | -1.0          | -10.0              |
|            | A    | 6.0 | 6.0           | -2.0          | -10.0              |
|            | B    | 8.0 | 7.0           | 3.0           | 20.0               |
| 2018-11-01 | A    | 2.0 | NaN           | NaN           | NaN                |
| 2018-12-01 | B    | 7.0 | 4.0           | 2.0           | 14.0               |
|            | C    | 9.0 | 4.0           | -2.0          | -14.0              |
| 2018-01-01 | A    | 1.0 | 0.0           | 0.0           | 0.0                |
|            | B    | 5.0 | 0.0           | 0.0           | 0.0                |
| 2018-02-01 | B    | 3.0 | NaN           | NaN           | NaN                |
| 2018-03-01 | A    | 2.0 | 1.0           | -1.0          | -4.0               |
|            | B    | 7.0 | 1.0           | 1.0           | 4.0                |
|            | C    | 3.0 | 0.0           | 0.0           | 0.0                |
|            | D    | 1.0 | 0.0           | 0.0           | 0.0                |
|            | E    | 6.0 | 0.0           | 0.0           | 0.0                |
| 2018-05-01 | B    | 3.0 | NaN           | NaN           | NaN                |
| 2018-06-01 | A    | 5.0 | 3.0           | -3.0          | -10.0              |
|            | B    | 2.0 | 3.0           | 3.0           | 13.0               |
|            | C    | 3.0 | 2.0           | 0.0           | -3.0               |
| 2018-07-01 | A    | 1.0 | NaN           | NaN           | NaN                |
| 2018-08-01 | B    | 9.0 | 2.0           | 0.0           | 3.0                |
|            | C    | 3.0 | 2.0           | 0.0           | -3.0               |
| 2018-09-01 | A    | 2.0 | 3.0           | -1.0          | -6.0               |
|            | B    | 7.0 | 3.0           | 1.0           | 6.0                |
| 2018-10-01 | C    | 3.0 | 5.0           | -1.0          | -10.0              |
|            | A    | 6.0 | 6.0           | -2.0          | -10.0              |
|            | B    | 8.0 | 7.0           | 3.0           | 20.0               |
| 2018-11-01 | A    | 2.0 | NaN           | NaN           | NaN                |
| 2018-12-01 | B    | 7.0 | 4.0           | 2.0           | 14.0               |
|            | C    | 9.0 | 4.0           | -2.0          | -14.0              |
+------------+------+-----+---------------+---------------+--------------------+