数据框列数超过阈值

时间:2019-01-21 15:42:43

标签: python-3.x pandas dataframe

我有一个数据框,我想在其中找到一个阈值以上的所有ID的计数。例如

  index  DEVICE_ID DIFF
   0         12     3
   1         12     4
   2         12     5
   3         12     3
   4         13     2
   5         13     4
   6         13     1
   7         14     3
   8         14     6

如果'Diff'大于或等于4,请给我提供从每个索引的索引开始的ID计数,因此上述数据框将得出:

  {12:3, 13:2, 14:1} - For ID 12, the diff column is 4 on index 1 so we count the amount of 12's from and including index 1 till 3

很抱歉出现措辞不佳的问题。

3 个答案:

答案 0 :(得分:3)

首先按Series.ge>=)比较列,然后按df['DEVICE_ID']分组并使用cumsum,按Series.gt比较并汇总sum计数True个值:

s = df['DIFF'].ge(4).groupby(df['DEVICE_ID']).cumsum().gt(0).astype(int)

out = s.groupby(df['DEVICE_ID']).sum().to_dict()
print (out)
{12: 3, 13: 2, 14: 1}

详细信息

print (df['DIFF'].ge(4).groupby(df['DEVICE_ID']).cumsum())
index
0    0.0
1    1.0
2    2.0
3    2.0
4    0.0
5    1.0
6    1.0
7    0.0
8    1.0
Name: DIFF, dtype: float64

另一种解决方案,其索引为DEVICE_ID,然后按索引为level=0,最后每个索引仅使用sumlevel=0):

out = (df.set_index(['DEVICE_ID'])['DIFF']
         .ge(4)
         .groupby(level=0)
         .cumsum()
         .gt(0)
         .astype(int)
         .sum(level=0)
         .to_dict())

答案 1 :(得分:3)

使用cumprod

s=df.DIFF.lt(4).astype(int).groupby(df['DEVICE_ID']).cumprod()
s=(1-s).groupby(df['DEVICE_ID']).sum()
s
DEVICE_ID
12    3
13    2
14    1
Name: DIFF, dtype: int32

答案 2 :(得分:2)

使用df.shift()

df['T_F']=(df.DIFF>=4)
df[df.T_F != df.T_F.shift(1)].groupby('DEVICE_ID')['DEVICE_ID'].count().to_dict()

{12: 3, 13: 2, 14: 1}