计算熊猫中具有多个条件的行

时间:2020-10-12 16:31:02

标签: python pandas dataframe datetime

我有一个带有“ user_ID”,“ datetime”和“ action_type”列的熊猫数据框,如下所示,我想通过执行一些计算来获取最后一列(最后一列=所需的输出):

data = {'user_id': list('ddabdacddaaa'), 
            'datetime':pd.date_range("20201001", periods=12, freq='H'), 
            'action_type':list('XXXWZWKOOXWX'), 
            'as_if_X_calculated':list('121021022223')
           }
df = pd.DataFrame(data)
df
    user_id datetime    action_type as_if_X_calculated
0   d   2020-10-01 00:00:00 X   1
1   d   2020-10-01 01:00:00 X   2
2   a   2020-10-01 02:00:00 X   1
3   b   2020-10-01 03:00:00 W   0
4   d   2020-10-01 04:00:00 Z   2
5   a   2020-10-01 05:00:00 W   1
6   c   2020-10-01 06:00:00 K   0
7   d   2020-10-01 07:00:00 O   2
8   d   2020-10-01 08:00:00 O   2
9   a   2020-10-01 09:00:00 X   2
10  a   2020-10-01 10:00:00 W   2
11  a   2020-10-01 11:00:00 X   3

因此,最后一列显示用户在当前记录时执行动作X的次数。如果我们看到用户“ a”,则其结果将按时间顺序类似于1-1-2-2-3。那么,如何计算给定用户在记录时或更早发生的操作X的次数?

P.S。在Excel中,它看起来像=countifs(A:A; A2; B:B; "<="&B2; C:C; "X")(列A =“ user_id”)

1 个答案:

答案 0 :(得分:0)

如果数据框按datetime排序,则可以为action_type上的条件创建一个临时列,并使用pd.expanding

df.sort_values('datetime', inplace=True)
df['dummy'] = df.action_type == 'X'
df['X_calculated'] = (df.groupby('user_id')['dummy']
                      .expanding().sum()
                      .reset_index(level=0, drop=True)
                      .astype('int'))
df.sort_index(inplace=True)
print(df.drop('dummy', 1))
assert df.as_if_X_calculated.astype('int').equals(df.X_calculated), 'X_calculated is not equal'

出局:

   user_id            datetime action_type as_if_X_calculated  X_calculated
0        d 2020-10-01 00:00:00           X                  1             1
1        d 2020-10-01 01:00:00           X                  2             2
2        a 2020-10-01 02:00:00           X                  1             1
3        b 2020-10-01 03:00:00           W                  0             0
4        d 2020-10-01 04:00:00           Z                  2             2
5        a 2020-10-01 05:00:00           W                  1             1
6        c 2020-10-01 06:00:00           K                  0             0
7        d 2020-10-01 07:00:00           O                  2             2
8        d 2020-10-01 08:00:00           O                  2             2
9        a 2020-10-01 09:00:00           X                  2             2
10       a 2020-10-01 10:00:00           W                  2             2
11       a 2020-10-01 11:00:00           X                  3             3