我试着计算一个州进入的频率和持续的时间。例如,我有三种可能的状态1,2和3,状态为活动状态记录在pandas Dataframe中:
test = pd.DataFrame([2,2,2,1,1,1,2,2,2,3,2,2,1,1], index=pd.date_range('00:00', freq='1h', periods=14))
例如,状态1输入两次(在索引3和12处),第一次持续三小时,第二次输入两小时(因此平均为2.5)。状态2输入3次,平均2.66小时。
我知道我可以屏蔽我不感兴趣的数据,例如分析状态1:
state1 = test.mask(test!=1)
但从那时起我无法找到继续前进的方法。
答案 0 :(得分:6)
我希望评论能给出足够的解释 - 关键是你可以使用自定义滚动窗口函数然后用cumsum将行分组为相同状态的“clumps”。
# set things up
freq = "1h"
df = pd.DataFrame(
[2,2,2,1,1,1,2,2,2,3,2,2,1,1],
index=pd.date_range('00:00', freq=freq, periods=14)
)
# add a column saying if a row belongs to the same state as the one before it
df["is_first"] = pd.rolling_apply(df, 2, lambda x: x[0] != x[1]).fillna(1)
# the cumulative sum - each "clump" gets its own integer id
df["value_group"] = df["is_first"].cumsum()
# get the rows corresponding to states beginning
start = df.groupby("value_group", as_index=False).nth(0)
# get the rows corresponding to states ending
end = df.groupby("value_group", as_index=False).nth(-1)
# put the timestamp indexes of the "first" and "last" state measurements into
# their own data frame
start_end = pd.DataFrame(
{
"start": start.index,
# add freq to get when the state ended
"end": end.index + pd.Timedelta(freq),
"value": start[0]
}
)
# convert timedeltas to seconds (float)
start_end["duration"] = (
(start_end["end"] - start_end["start"]).apply(float) / 1e9
)
# get average state length and counts
agg = start_end.groupby("value").agg(["mean", "count"])["duration"]
agg["mean"] = agg["mean"] / (60 * 60)
输出:
mean count
value
1 2.500000 2
2 2.666667 3
3 1.000000 1