我试图计算id, date
每datetime
与上一行的差异是10秒。
数据
id timestamp datetime date
1 1496660340 2019-06-05 10:59:00 2019-06-05
1 1496660340 2019-06-05 10:59:10 2019-06-05
1 1496660355 2019-06-05 10:59:40 2019-06-05 <- 30 sec diff from above, so not counted
1 1496655555 2019-06-06 11:58:00 2019-06-06
1 1496666666 2019-06-06 11:58:10 2019-06-06
1 1496666677 2019-06-06 11:58:20 2019-06-06
2 1496655555 2019-06-05 11:58:00 2019-06-05
2 1496666666 2019-06-05 11:58:10 2019-06-05
2 1496666677 2019-06-05 11:58:20 2019-06-05
Data columns (total 4 columns):
id int64
timestamp int64
datetime datetime64[ns]
date object
所需
id date num_count
1 2019-06-05 1
1 2019-06-06 2
2 2019-06-05 2
我尝试过的事情
# get all the time differences first
df['timediff'] = df.groupby(['id','date'])['datetime'].diff() / np.timedelta64(1, 's')
#Count the number of 10sec differences
x = pd.DataFrame(df[df['timediff']==10].groupby(['id','date'],as_index=False)['timediff'].count())
我不确定这是否是正确的方法。有人可以指出我正确的方向吗?
答案 0 :(得分:1)
您可以对groupby
使用自定义函数:
def difference_condition(x):
return x.diff().dt.total_seconds().eq(10).sum()
res = df.groupby(['id', 'date'])['datetime'].apply(difference_condition)
print(res.reset_index(name='count'))
id date count
0 1 2019-06-05 1
1 1 2019-06-06 2
2 2 2019-06-05 2
设置
from io import StringIO
x = """id|timestamp|datetime|date
1 |1496660340 |2019-06-05 10:59:00 |2019-06-05
1 |1496660340 |2019-06-05 10:59:10 |2019-06-05
1 |1496660355 |2019-06-05 10:59:40 |2019-06-05
1 |1496655555 |2019-06-06 11:58:00 |2019-06-06
1 |1496666666 |2019-06-06 11:58:10 |2019-06-06
1 |1496666677 |2019-06-06 11:58:20 |2019-06-06
2 |1496655555 |2019-06-05 11:58:00 |2019-06-05
2 |1496666666 |2019-06-05 11:58:10 |2019-06-05
2 |1496666677 |2019-06-05 11:58:20 |2019-06-05"""
df = pd.read_csv(StringIO(x), sep='|')
df[['datetime', 'date']] = df[['datetime', 'date']].apply(pd.to_datetime)