我做了多天的观察,几天后就可以观察到一位客户,这是我的数据,
customer_id value timestamp
1 1000 2018-05-28 03:40:00.000
1 1450 2018-05-28 04:40:01.000
1 1040 2018-05-28 05:40:00.000
1 1500 2018-05-29 02:40:00.000
1 1090 2018-05-29 04:40:00.000
3 1060 2018-05-18 03:40:00.000
3 1040 2018-05-18 05:40:00.000
3 1520 2018-05-19 03:40:00.000
3 1490 2018-05-19 04:40:00.000
根据先前的问题How do I building dt.hour in 2 days,第一个出现的客户是2018-05-28 03:40:00.000
并标记为Day1 - 3
,但出于另一个目的,应该是Day1 - 0
,因此输出应为< / p>
customer_id value timestamp hour
1 1000 2018-05-28 03:40:00.000 Day1 - 0
1 1450 2018-05-28 04:40:01.000 Day1 - 1
1 1040 2018-05-28 05:40:00.000 Day1 - 2
1 1500 2018-05-29 02:40:00.000 Day1 - 23
1 1090 2018-05-29 04:40:00.000 Day2 - 1
3 1060 2018-05-18 03:40:00.000 Day1 - 0
3 1040 2018-05-18 05:40:00.000 Day1 - 2
3 1520 2018-05-19 03:40:00.000 Day2 - 0
3 1490 2018-05-19 04:40:00.000 Day2 - 1
答案 0 :(得分:1)
我认为需要为正确的cumcount
添加所有错误时间:
#floor to hours
df['timestamp'] = df['timestamp'].dt.floor('h')
#add missing hours per group
df = df.set_index('timestamp').groupby('customer_id').apply(lambda x: x.asfreq('h'))
#cumulative count per group
df['hour'] = df.groupby(level=0).cumcount()
df= df.dropna(subset=['customer_id']).drop('customer_id', 1).reset_index()
df['hour'] = ('Day' + (df['hour'] // 24).add(1).astype(str) +
' - ' + (df['hour'] % 24).astype(str))
print (df)
customer_id timestamp value hour
0 1 2018-05-28 03:00:00 1000.0 Day1 - 0
1 1 2018-05-28 04:00:00 1450.0 Day1 - 1
2 1 2018-05-28 05:00:00 1040.0 Day1 - 2
3 1 2018-05-29 02:00:00 1500.0 Day1 - 23
4 1 2018-05-29 04:00:00 1090.0 Day2 - 1
5 3 2018-05-18 03:00:00 1060.0 Day1 - 0
6 3 2018-05-18 05:00:00 1040.0 Day1 - 2
7 3 2018-05-19 03:00:00 1520.0 Day2 - 0
8 3 2018-05-19 04:00:00 1490.0 Day2 - 1