首先,我很抱歉如果标题太含糊。
我有一个pd.DataFrame
,其中datetime64
作为索引的dtype。但是,这些索引之间的距离并不相等:它们之间的间隔通常为一分钟,但通常还有其他间隔,例如两分钟。
假设我有一个pd.DataFrame
:
df = pd.DataFrame({'date': ['2018-11-28 13:59:00', '2018-11-28 14:00:00',
'2018-11-28 14:01:00', '2018-11-28 14:02:00',
'2018-11-28 14:03:00', '2018-11-28 14:05:00',
'2018-11-28 14:06:00', '2018-11-28 14:07:00',
'2018-11-28 14:08:00', '2018-11-28 14:09:00'],
'count': np.random.randint(1, 100, 10)})
datetime_index = pd.to_datetime(df['date'])
df = df.set_index(datetime_index).drop('date', 1)
df.sort_index(inplace=True)
使得df
为:
count
date
2018-11-28 13:59:00 14
2018-11-28 14:00:00 30
2018-11-28 14:01:00 2
2018-11-28 14:02:00 42
2018-11-28 14:03:00 51<<< two minutes gap
2018-11-28 14:05:00 41<<< unlike others
2018-11-28 14:06:00 48
2018-11-28 14:07:00 4
2018-11-28 14:08:00 50
2018-11-28 14:09:00 93
我的目标是将df
分成多个块,每个块的持续频率为一分钟。因此,上面的预期结果将变为:
#df0
count
date
2018-11-28 13:59:00 14
2018-11-28 14:00:00 30
2018-11-28 14:01:00 2
2018-11-28 14:02:00 42
2018-11-28 14:03:00 51
#df1
count
date
2018-11-28 14:05:00 41
2018-11-28 14:06:00 48
2018-11-28 14:07:00 4
2018-11-28 14:08:00 50
2018-11-28 14:09:00 93
我尝试过Split a series on time gaps in pandas?,但可悲的是它已经过时并且没有达到我的目的。
我确实达到了上述示例的要求,但是实际数据框更大,并且还有更多空白,这使得以下解决方案效率极低:
df['diff'] = pd.Series(df.index).diff().values
dif = pd.Series(df.index).diff()
gap_index = dif[dif == pd.to_timedelta(120000000000)].index[0]
df[:gap_index], df[gap_index:]
非常感谢您对此问题的见识
答案 0 :(得分:1)
这是快速而肮脏的解决方案:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': ['2018-11-28 13:59:00', '2018-11-28 14:00:00',
'2018-11-28 14:01:00', '2018-11-28 14:02:00',
'2018-11-28 14:03:00', '2018-11-28 14:05:00',
'2018-11-28 14:06:00', '2018-11-28 14:07:00',
'2018-11-28 14:08:00', '2018-11-28 14:09:00'],
'count': np.random.randint(1, 100, 10)})
df['date'] = pd.to_datetime(df['date'])
df.sort_index(inplace=True)
# calculate where to cut
df['cut_point'] = pd.to_datetime(df.date.shift(-1))[0:len(df)-1]- df.date[0:len(df)-1] > '00:01:00'
df['cut_point'] = df['cut_point'].shift(1)
# generate chunks
res = []
chunk = []
for i,row in df.iterrows():
date = row['date']
count = row['count']
cut_point = row['cut_point']
if cut_point == True:
res.append(pd.DataFrame(chunk))
del chunk[:]
chunk.append({'date':date,'count':count})
else:
chunk.append({'date':date,'count':count})
res.append(pd.DataFrame(chunk))
print(res[0])
print(res[1])
答案 1 :(得分:1)
如果您有兴趣创建一个字典,其中将包含所有单独的数据帧,则可能应该可以:
df['identifier']=(~df.index.to_series().diff().dt.seconds.div(60, fill_value=0).lt(2)).cumsum()
count identifier
date
2018-11-28 13:59:00 7 0
2018-11-28 14:00:00 49 0
2018-11-28 14:01:00 13 0
2018-11-28 14:02:00 47 0
2018-11-28 14:03:00 72 0
2018-11-28 14:05:00 33 1
2018-11-28 14:06:00 50 1
2018-11-28 14:07:00 10 1
2018-11-28 14:08:00 86 1
2018-11-28 14:09:00 40 1
在此之后创建字典并附加组:
d = {}
for i,grp in df.groupby('identifier'):
d.update(dict([('df_' + str(i),grp)]))
print(d)
输出:
{'df_0': count identifier
date
2018-11-28 13:59:00 7 0
2018-11-28 14:00:00 49 0
2018-11-28 14:01:00 13 0
2018-11-28 14:02:00 47 0
2018-11-28 14:03:00 72 0,
'df_1': count identifier
date
2018-11-28 14:05:00 33 1
2018-11-28 14:06:00 50 1
2018-11-28 14:07:00 10 1
2018-11-28 14:08:00 86 1
2018-11-28 14:09:00 40 1}
然后您可以通过调用dict键来查看数据:
print(d['df_1'])
count identifier
date
2018-11-28 14:05:00 95 1
2018-11-28 14:06:00 68 1
2018-11-28 14:07:00 19 1
2018-11-28 14:08:00 9 1
2018-11-28 14:09:00 61 1