Question

我有一个时间序列数据框df看起来像这样（时间序列发生在同一天，但是在不同的时间内：

                                id               val 
 time                    
2014-04-03 16:01:53             23              14389      
2014-04-03 16:01:54             28              14391             
2014-04-03 16:05:55             24              14393             
2014-04-03 16:06:25             23              14395             
2014-04-03 16:07:01             23              14395             
2014-04-03 16:10:09             23              14395             
2014-04-03 16:10:23             26              14397             
2014-04-03 16:10:57             26              14397             
2014-04-03 16:11:10             26              14397

我需要从16:00:00开始每隔5分钟创建一个组。这就是16:00:00到16:05:00范围内的所有行，新列period的值为1.（每个组中的行数不规则，所以我可以＆＃ 39; t简单地削减小组）

最终，数据应如下所示：

                                id               val           period 
time            
2014-04-03 16:01:53             23              14389             1
2014-04-03 16:01:54             28              14391             1
2014-04-03 16:05:55             24              14393             2
2014-04-03 16:06:25             23              14395             2
2014-04-03 16:07:01             23              14395             2
2014-04-03 16:10:09             23              14395             3
2014-04-03 16:10:23             26              14397             3
2014-04-03 16:10:57             26              14397             3
2014-04-03 16:11:10             26              14397             3

目的是执行一些groupby操作，但我需要执行的操作不包含在pd.resample(how=' ')方法中。因此，我必须创建一个period列来标识每个组，然后执行df.groupby('period').apply(myfunc)。

非常感谢任何帮助或评论。

谢谢！

Answer 1

您可以在TimeGrouper中使用groupy/apply功能。使用TimeGrouper，您无需创建期间列。我知道你并没有尝试计算平均数，但我会用它作为例子：

>>> df.groupby(pd.TimeGrouper('5Min'))['val'].mean()

time
2014-04-03 16:00:00    14390.000000
2014-04-03 16:05:00    14394.333333
2014-04-03 16:10:00    14396.500000

或带有明确apply的示例：

>>> df.groupby(pd.TimeGrouper('5Min'))['val'].apply(lambda x: len(x) > 3)

time
2014-04-03 16:00:00    False
2014-04-03 16:05:00    False
2014-04-03 16:10:00     True

TimeGrouper的文档字符串：

Docstring for resample:class TimeGrouper@21

TimeGrouper(self, freq = 'Min', closed = None, label = None,
how = 'mean', nperiods = None, axis = 0, fill_method = None,
limit = None, loffset = None, kind = None, convention = None, base = 0,
**kwargs)

Custom groupby class for time-interval grouping

Parameters
----------
freq : pandas date offset or offset alias for identifying bin edges
closed : closed end of interval; left or right
label : interval boundary to use for labeling; left or right
nperiods : optional, integer
convention : {'start', 'end', 'e', 's'}
    If axis is PeriodIndex

Notes
-----
Use begin, end, nperiods to generate intervals that cannot be derived
directly from the associated object

修改

我不知道创建句号列的优雅方法，但以下内容可行：

>>> new = df.groupby(pd.TimeGrouper('5Min'),as_index=False).apply(lambda x: x['val']) >>> df['period'] = new.index.get_level_values(0) >>> df id val period time 2014-04-03 16:01:53 23 14389 0 2014-04-03 16:01:54 28 14391 0 2014-04-03 16:05:55 24 14393 1 2014-04-03 16:06:25 23 14395 1 2014-04-03 16:07:01 23 14395 1 2014-04-03 16:10:09 23 14395 2 2014-04-03 16:10:23 26 14397 2 2014-04-03 16:10:57 26 14397 2 2014-04-03 16:11:10 26 14397 2

它的工作原理是因为as_index = False这里的groupby实际上返回了你想要作为multiindex的一部分的period列，我只是获取了多索引的那一部分并分配给原始数据帧中的一个新列。你可以在申请中做任何事情，我只想要索引：

>>> new time 0 2014-04-03 16:01:53 14389 2014-04-03 16:01:54 14391 1 2014-04-03 16:05:55 14393 2014-04-03 16:06:25 14395 2014-04-03 16:07:01 14395 2 2014-04-03 16:10:09 14395 2014-04-03 16:10:23 14397 2014-04-03 16:10:57 14397 2014-04-03 16:11:10 14397 >>> new.index.get_level_values(0) Int64Index([0, 0, 1, 1, 1, 2, 2, 2, 2], dtype='int64')

Answer 2

根据您的理解，如果我理解正确的问题，只需使用重新采样方法就可以轻松完成

#Get some data
index = pd.DatetimeIndex(start='2013-01-01 00:00', end='2013-01-31 00:00', freq='min')
a = np.random.randint(20, high=30, size=(len(index),1))
b = np.random.randint(14440, high=14449, size=(len(index),1))
df = pd.DataFrame(np.concatenate((a,b), axis=1), index=index, columns=['id','val'])
df.head()


Out[34]:
                     id  val
2013-01-01 00:00:00  20  14446
2013-01-01 00:01:00  25  14443
2013-01-01 00:02:00  25  14448
2013-01-01 00:03:00  20  14445
2013-01-01 00:04:00  28  14442

#Define function for variance
import numpy as np
def pyfun(X):

    if X.shape[0] <= 1:
        result = nan

    else:    
        total = 0
        for x in X:
            total = total + x
        mean = float(total) / X.shape[0]

        total = 0
        for x in X:
            total = total + (mean-x)**2
        result = float(total) / (X.shape[0]-1)

    return result

#Try it out
df.resample('5min', how=pyfun)


Out[53]:
                     id val
2013-01-01 00:00:00  12.3    5.7
2013-01-01 00:05:00  9.3     7.3
2013-01-01 00:10:00  4.7     0.8
2013-01-01 00:15:00  10.8    10.3
2013-01-01 00:20:00  11.5    1.5

那很容易。这是为了您自己的功能，但是如果您想使用库中的函数，那么您需要做的就是在how关键字中指定函数

df.resample('5min', how=np.var).head()


Out[54]:
                     id val
2013-01-01 00:00:00  12.3    5.7
2013-01-01 00:05:00  9.3     7.3
2013-01-01 00:10:00  4.7     0.8
2013-01-01 00:15:00  10.8    10.3
2013-01-01 00:20:00  11.5    1.5

如何在熊猫时间序列中基于5分钟间隔创建组ID？

2 个答案: