我有一个带有“ user_ID”,“ datetime”和“ action_type”列的熊猫数据框,如下所示,我想通过执行一些计算来获取最后一列(最后一列=所需的输出):
data = {'user_id': list('ddabdacddaaa'),
'datetime':pd.date_range("20201001", periods=12, freq='H'),
'action_type':list('XXXWZWKOOXWX'),
'as_if_X_calculated':list('121021022223')
}
df = pd.DataFrame(data)
df
user_id datetime action_type as_if_X_calculated
0 d 2020-10-01 00:00:00 X 1
1 d 2020-10-01 01:00:00 X 2
2 a 2020-10-01 02:00:00 X 1
3 b 2020-10-01 03:00:00 W 0
4 d 2020-10-01 04:00:00 Z 2
5 a 2020-10-01 05:00:00 W 1
6 c 2020-10-01 06:00:00 K 0
7 d 2020-10-01 07:00:00 O 2
8 d 2020-10-01 08:00:00 O 2
9 a 2020-10-01 09:00:00 X 2
10 a 2020-10-01 10:00:00 W 2
11 a 2020-10-01 11:00:00 X 3
因此,最后一列显示用户在当前记录时执行动作X的次数。如果我们看到用户“ a”,则其结果将按时间顺序类似于1-1-2-2-3。那么,如何计算给定用户在记录时或更早发生的操作X的次数?
P.S。在Excel中,它看起来像=countifs(A:A; A2; B:B; "<="&B2; C:C; "X")
(列A =“ user_id”)
答案 0 :(得分:0)
如果数据框按datetime
排序,则可以为action_type
上的条件创建一个临时列,并使用pd.expanding
df.sort_values('datetime', inplace=True)
df['dummy'] = df.action_type == 'X'
df['X_calculated'] = (df.groupby('user_id')['dummy']
.expanding().sum()
.reset_index(level=0, drop=True)
.astype('int'))
df.sort_index(inplace=True)
print(df.drop('dummy', 1))
assert df.as_if_X_calculated.astype('int').equals(df.X_calculated), 'X_calculated is not equal'
出局:
user_id datetime action_type as_if_X_calculated X_calculated
0 d 2020-10-01 00:00:00 X 1 1
1 d 2020-10-01 01:00:00 X 2 2
2 a 2020-10-01 02:00:00 X 1 1
3 b 2020-10-01 03:00:00 W 0 0
4 d 2020-10-01 04:00:00 Z 2 2
5 a 2020-10-01 05:00:00 W 1 1
6 c 2020-10-01 06:00:00 K 0 0
7 d 2020-10-01 07:00:00 O 2 2
8 d 2020-10-01 08:00:00 O 2 2
9 a 2020-10-01 09:00:00 X 2 2
10 a 2020-10-01 10:00:00 W 2 2
11 a 2020-10-01 11:00:00 X 3 3