Question

我有一个如下所示的df：

每个cust_id在2019年的每个季度都有一个值。我绘制了数据框，以显示每个客户随时间的价值。该图如下所示：

红线是所有客户每个时间段的平均值。这是我用来创建该图的代码：

over_time = df.pivot_table(index='cust_id',columns='date_id',values='trx_unt', fill_value=0)
over_time = over_time.reset_index(level=0)
df_m = pd.melt(over_time, id_vars=['cust_id'])

# Create an average line to compare
df_m['date_id'] = df_m['date_id'].astype('str')
agg = df_m.groupby('date_id').agg('mean').reset_index()


#graph each cust as a line and compare to average
fig, ax = plt.subplots(figsize=(20,10))
for name, group in df_m.groupby('cust_id'):
    group.plot('date_id', y='value', ax=ax, legend=None, color = 'c')
    plt.xticks(rotation = 90)
agg.plot('date_id', y='value',ax=ax, legend=None, color = 'red')
plt.show()

目标：我想查找每个时间段内始终超过红线或高于平均值的cust_id。

我不确定该如何处理。谢谢

Answer 1

在每个时间段附加平均值，然后按客户分组，仅标记均高于平均值的那些元素，然后进行查询。

它可能看起来像这样：

df = df.merge(time_data, on='time', how='left', validate='m:1')
df['gt_mean'] = df['value'] > df['average']
df['select'] = df.groupby('customer')['gt_mean'].all()

df.query('select == 1')

寻找持续高于平均水平的患者

1 个答案: