我有一个数据框,我想在其中找到一个阈值以上的所有ID的计数。例如
index DEVICE_ID DIFF
0 12 3
1 12 4
2 12 5
3 12 3
4 13 2
5 13 4
6 13 1
7 14 3
8 14 6
如果'Diff'大于或等于4,请给我提供从每个索引的索引开始的ID计数,因此上述数据框将得出:
{12:3, 13:2, 14:1} - For ID 12, the diff column is 4 on index 1 so we count the amount of 12's from and including index 1 till 3
很抱歉出现措辞不佳的问题。
答案 0 :(得分:3)
首先按Series.ge
(>=
)比较列,然后按df['DEVICE_ID']
分组并使用cumsum
,按Series.gt
比较并汇总sum
计数True
个值:
s = df['DIFF'].ge(4).groupby(df['DEVICE_ID']).cumsum().gt(0).astype(int)
out = s.groupby(df['DEVICE_ID']).sum().to_dict()
print (out)
{12: 3, 13: 2, 14: 1}
详细信息:
print (df['DIFF'].ge(4).groupby(df['DEVICE_ID']).cumsum())
index
0 0.0
1 1.0
2 2.0
3 2.0
4 0.0
5 1.0
6 1.0
7 0.0
8 1.0
Name: DIFF, dtype: float64
另一种解决方案,其索引为DEVICE_ID
,然后按索引为level=0
,最后每个索引仅使用sum
(level=0
):
out = (df.set_index(['DEVICE_ID'])['DIFF']
.ge(4)
.groupby(level=0)
.cumsum()
.gt(0)
.astype(int)
.sum(level=0)
.to_dict())
答案 1 :(得分:3)
使用cumprod
s=df.DIFF.lt(4).astype(int).groupby(df['DEVICE_ID']).cumprod()
s=(1-s).groupby(df['DEVICE_ID']).sum()
s
DEVICE_ID
12 3
13 2
14 1
Name: DIFF, dtype: int32
答案 2 :(得分:2)
df['T_F']=(df.DIFF>=4)
df[df.T_F != df.T_F.shift(1)].groupby('DEVICE_ID')['DEVICE_ID'].count().to_dict()
{12: 3, 13: 2, 14: 1}