我有一个以日期时间索引的熊猫(pandas==0.23.4
)数据帧df
,其列名为value_id
。
value_id
包含浮点值组(5.0
或6.0
)和NaN
组。我想计算5.0
和6.0
的连续组数。这些组必须至少包含三个连续的值。
例如:
In [1]: print df.value_id
timestamp
2019-01-06 17:42:08 NaN
2019-01-06 17:45:08 5.0
2019-01-06 17:48:08 5.0
2019-01-06 17:51:08 5.0
2019-01-06 17:54:08 NaN
2019-01-06 17:57:08 NaN
2019-01-06 18:00:08 NaN
2019-01-06 18:03:08 NaN
2019-01-06 18:06:08 NaN
2019-01-06 18:09:08 NaN
2019-01-06 18:12:08 6.0
2019-01-06 18:15:08 6.0
2019-01-06 19:54:09 NaN
2019-01-06 19:57:09 5.0
2019-01-06 20:00:08 5.0
2019-01-06 20:03:08 5.0
2019-01-06 20:06:09 NaN
2019-01-06 20:09:08 NaN
2019-01-06 20:12:08 NaN
2019-01-06 20:15:09 NaN
2019-01-06 20:18:08 NaN
2019-01-06 20:21:09 NaN
2019-01-06 20:24:09 NaN
2019-01-07 19:09:07 NaN
2019-01-07 19:12:06 NaN
2019-01-07 19:15:06 5.0
2019-01-07 19:18:06 5.0
2019-01-07 19:21:07 5.0
2019-01-07 19:24:07 5.0
2019-01-07 19:27:07 NaN
2019-01-07 19:30:07 NaN
2019-01-07 19:33:06 NaN
2019-01-07 19:36:07 NaN
2019-01-07 19:39:07 NaN
2019-01-07 19:42:06 NaN
2019-01-07 19:45:06 NaN
2019-01-07 19:48:06 NaN
2019-01-07 19:51:06 6.0
2019-01-07 19:54:07 6.0
2019-01-07 19:57:06 6.0
Name: value_id, dtype: float64
如果我有两个分别名为count1
(用于5.0个值组)和count2
(用于6.0个值组)的变量,则为上述示例分配的结果计数为:
count1
:3
count2
:1
答案 0 :(得分:1)
也许不是最优雅,但是您可以使用shift
来检查接下来的两个项目是否具有相同的值,并且先前的值是否属于同一组的 not :
df['fives'] = ((df['timestamp'] == 5) & (df['timestamp'].shift(-1) == 5)
& (df['timestamp'].shift(-2) == 5)
& (df['timestamp'].shift(1) != 5)).astype(int)
df['sixes'] = ((df['timestamp'] == 6) & (df['timestamp'].shift(-1) == 6)
& (df['timestamp'].shift(-2) == 6)
& (df['timestamp'].shift(1) != 6)).astype(int)
df[['fives','sixes']].sum()
fives 3
sixes 1
dtype: int64
答案 1 :(得分:1)
IIUC使用cumsum
创建组密钥,然后我们只做value_counts
s.groupby(s.isnull().cumsum()).value_counts().ge(3).sum(level=1)
Out[1026]:
timestamp
5.0 3.0
6.0 1.0
Name: timestamp, dtype: float64