按日期时间索引中的空白来分块DataFrame

时间:2019-01-28 12:00:03

标签: python pandas dataframe

首先,我很抱歉如果标题太含糊。

我有一个pd.DataFrame,其中datetime64作为索引的dtype。但是,这些索引之间的距离并不相等:它们之间的间隔通常为一分钟,但通常还有其他间隔,例如两分钟。

假设我有一个pd.DataFrame

df = pd.DataFrame({'date': ['2018-11-28 13:59:00', '2018-11-28 14:00:00',
               '2018-11-28 14:01:00', '2018-11-28 14:02:00',
               '2018-11-28 14:03:00', '2018-11-28 14:05:00',
               '2018-11-28 14:06:00', '2018-11-28 14:07:00',
               '2018-11-28 14:08:00', '2018-11-28 14:09:00'], 
                   'count': np.random.randint(1, 100, 10)})
datetime_index = pd.to_datetime(df['date'])
df = df.set_index(datetime_index).drop('date', 1)
df.sort_index(inplace=True)

使得df为:

    count
date    
2018-11-28 13:59:00 14
2018-11-28 14:00:00 30
2018-11-28 14:01:00 2
2018-11-28 14:02:00 42
2018-11-28 14:03:00 51<<< two minutes gap
2018-11-28 14:05:00 41<<< unlike others
2018-11-28 14:06:00 48
2018-11-28 14:07:00 4
2018-11-28 14:08:00 50
2018-11-28 14:09:00 93

我的目标是将df分成多个块,每个块的持续频率为一分钟。因此,上面的预期结果将变为:

#df0
    count
date    
2018-11-28 13:59:00 14
2018-11-28 14:00:00 30
2018-11-28 14:01:00 2
2018-11-28 14:02:00 42
2018-11-28 14:03:00 51
#df1
    count
date   
2018-11-28 14:05:00 41
2018-11-28 14:06:00 48
2018-11-28 14:07:00 4
2018-11-28 14:08:00 50
2018-11-28 14:09:00 93

我尝试过Split a series on time gaps in pandas?,但可悲的是它已经过时并且没有达到我的目的。

我确实达到了上述示例的要求,但是实际数据框更大,并且还有更多空白,这使得以下解决方案效率极低:

df['diff'] = pd.Series(df.index).diff().values
dif = pd.Series(df.index).diff()
gap_index = dif[dif == pd.to_timedelta(120000000000)].index[0]
df[:gap_index], df[gap_index:]

非常感谢您对此问题的见识

2 个答案:

答案 0 :(得分:1)

这是快速而肮脏的解决方案:

import pandas as pd
import numpy as np

df = pd.DataFrame({'date': ['2018-11-28 13:59:00', '2018-11-28 14:00:00',
           '2018-11-28 14:01:00', '2018-11-28 14:02:00',
           '2018-11-28 14:03:00', '2018-11-28 14:05:00',
           '2018-11-28 14:06:00', '2018-11-28 14:07:00',
           '2018-11-28 14:08:00', '2018-11-28 14:09:00'],
               'count': np.random.randint(1, 100, 10)})

df['date'] = pd.to_datetime(df['date'])
df.sort_index(inplace=True)

# calculate where to cut
df['cut_point'] = pd.to_datetime(df.date.shift(-1))[0:len(df)-1]- df.date[0:len(df)-1] > '00:01:00'
df['cut_point'] = df['cut_point'].shift(1)

# generate chunks
res = []
chunk = []

for i,row in df.iterrows():
    date = row['date']
    count = row['count']
    cut_point = row['cut_point']

    if cut_point == True:

        res.append(pd.DataFrame(chunk))

        del chunk[:]

        chunk.append({'date':date,'count':count})

    else:
        chunk.append({'date':date,'count':count})

res.append(pd.DataFrame(chunk))

print(res[0])

print(res[1])

答案 1 :(得分:1)

如果您有兴趣创建一个字典,其中将包含所有单独的数据帧,则可能应该可以:

df['identifier']=(~df.index.to_series().diff().dt.seconds.div(60, fill_value=0).lt(2)).cumsum()

                     count  identifier
date                                  
2018-11-28 13:59:00      7           0
2018-11-28 14:00:00     49           0
2018-11-28 14:01:00     13           0
2018-11-28 14:02:00     47           0
2018-11-28 14:03:00     72           0
2018-11-28 14:05:00     33           1
2018-11-28 14:06:00     50           1
2018-11-28 14:07:00     10           1
2018-11-28 14:08:00     86           1
2018-11-28 14:09:00     40           1

在此之后创建字典并附加组:

d = {}
for i,grp in df.groupby('identifier'):
    d.update(dict([('df_' + str(i),grp)]))
print(d)

输出:

{'df_0':                      count  identifier


date                                  
 2018-11-28 13:59:00      7           0
 2018-11-28 14:00:00     49           0
 2018-11-28 14:01:00     13           0
 2018-11-28 14:02:00     47           0
 2018-11-28 14:03:00     72           0,
 'df_1':                      count  identifier
 date                                  
 2018-11-28 14:05:00     33           1
 2018-11-28 14:06:00     50           1
 2018-11-28 14:07:00     10           1
 2018-11-28 14:08:00     86           1
 2018-11-28 14:09:00     40           1}

然后您可以通过调用dict键来查看数据:

print(d['df_1'])
                     count  identifier
date                                  
2018-11-28 14:05:00     95           1
2018-11-28 14:06:00     68           1
2018-11-28 14:07:00     19           1
2018-11-28 14:08:00      9           1
2018-11-28 14:09:00     61           1