如何使用日期时间索引将df重新采样到n个相同大小的时段?

时间:2015-07-03 17:54:12

标签: python pandas

我有一个带有日期时间索引的大型数据框,需要将数据重新采样到10个相同大小的时段。

到目前为止,我已经尝试找到第一个和最后一个日期来确定数据中的总天数,将其除以10以确定每个期间的大小,然后使用该天数重新采样。例如:

first = df.reset_index().timesubmit.min()
last = df.reset_index().timesubmit.max()
periodsize = str((last-first).days/10) + 'D'

df.resample(periodsize,how='sum')

这不能保证重新采样后df中的10个周期,因为周期大小是向下舍入的int。使用浮子不会在重采样中起作用。似乎我在这里缺少一些简单的东西,或者我对这个问题的攻击都是错误的。

2 个答案:

答案 0 :(得分:1)

以下是使用np.linspace()上的pd.Timedelta然后使用pd.cut将每个障碍分类到不同的分区来确保相等大小的子句点的一种方法。

import pandas as pd
import numpy as np

# generate artificial data
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 2), columns=['A', 'B'], index=pd.date_range('2015-01-01 00:00:00', periods=100, freq='8H'))

Out[87]: 
                          A       B
2015-01-01 00:00:00  1.7641  0.4002
2015-01-01 08:00:00  0.9787  2.2409
2015-01-01 16:00:00  1.8676 -0.9773
2015-01-02 00:00:00  0.9501 -0.1514
2015-01-02 08:00:00 -0.1032  0.4106
2015-01-02 16:00:00  0.1440  1.4543
2015-01-03 00:00:00  0.7610  0.1217
2015-01-03 08:00:00  0.4439  0.3337
2015-01-03 16:00:00  1.4941 -0.2052
2015-01-04 00:00:00  0.3131 -0.8541
2015-01-04 08:00:00 -2.5530  0.6536
2015-01-04 16:00:00  0.8644 -0.7422
2015-01-05 00:00:00  2.2698 -1.4544
2015-01-05 08:00:00  0.0458 -0.1872
2015-01-05 16:00:00  1.5328  1.4694
...                     ...     ...
2015-01-29 08:00:00  0.9209  0.3187
2015-01-29 16:00:00  0.8568 -0.6510
2015-01-30 00:00:00 -1.0342  0.6816
2015-01-30 08:00:00 -0.8034 -0.6895
2015-01-30 16:00:00 -0.4555  0.0175
2015-01-31 00:00:00 -0.3540 -1.3750
2015-01-31 08:00:00 -0.6436 -2.2234
2015-01-31 16:00:00  0.6252 -1.6021
2015-02-01 00:00:00 -1.1044  0.0522
2015-02-01 08:00:00 -0.7396  1.5430
2015-02-01 16:00:00 -1.2929  0.2671
2015-02-02 00:00:00 -0.0393 -1.1681
2015-02-02 08:00:00  0.5233 -0.1715
2015-02-02 16:00:00  0.7718  0.8235
2015-02-03 00:00:00  2.1632  1.3365

[100 rows x 2 columns]


# cutoff points, 10 equal-size group requires 11 points
# measured by timedelta 1 hour
time_delta_in_hours = (df.index - df.index[0]) / pd.Timedelta('1h')
n = 10
ts_cutoff = np.linspace(0, time_delta_in_hours[-1], n+1)
# labels, time index
time_index = df.index[0] + np.array([pd.Timedelta(str(time_delta)+'h') for time_delta in ts_cutoff])

# create a categorical reference variables
df['start_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[:-1])
# for clarity, reassign labels using end-period index
df['end_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[1:])

Out[89]: 
                          A       B    start_time_index      end_time_index
2015-01-01 00:00:00  1.7641  0.4002 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 08:00:00  0.9787  2.2409 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 16:00:00  1.8676 -0.9773 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 00:00:00  0.9501 -0.1514 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 08:00:00 -0.1032  0.4106 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 16:00:00  0.1440  1.4543 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 00:00:00  0.7610  0.1217 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 08:00:00  0.4439  0.3337 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 16:00:00  1.4941 -0.2052 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 00:00:00  0.3131 -0.8541 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 08:00:00 -2.5530  0.6536 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-04 16:00:00  0.8644 -0.7422 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 00:00:00  2.2698 -1.4544 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 08:00:00  0.0458 -0.1872 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 16:00:00  1.5328  1.4694 2015-01-04 07:12:00 2015-01-07 14:24:00
...                     ...     ...                 ...                 ...
2015-01-29 08:00:00  0.9209  0.3187 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-29 16:00:00  0.8568 -0.6510 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 00:00:00 -1.0342  0.6816 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 08:00:00 -0.8034 -0.6895 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 16:00:00 -0.4555  0.0175 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-31 00:00:00 -0.3540 -1.3750 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 08:00:00 -0.6436 -2.2234 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 16:00:00  0.6252 -1.6021 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 00:00:00 -1.1044  0.0522 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 08:00:00 -0.7396  1.5430 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 16:00:00 -1.2929  0.2671 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 00:00:00 -0.0393 -1.1681 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 08:00:00  0.5233 -0.1715 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 16:00:00  0.7718  0.8235 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-03 00:00:00  2.1632  1.3365 2015-01-30 16:48:00 2015-02-03 00:00:00

[100 rows x 4 columns]

df.groupby('start_time_index').agg('sum')

Out[90]: 
                          A       B
start_time_index                   
2015-01-01 00:00:00  8.6133  2.7734
2015-01-04 07:12:00  1.9220 -0.8069
2015-01-07 14:24:00 -8.1334  0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00  1.1957  7.2285
2015-01-17 12:00:00  3.2485  6.6841
2015-01-20 19:12:00 -0.8903  2.2802
2015-01-24 02:24:00 -2.1025  1.3800
2015-01-27 09:36:00 -1.1017  1.3108
2015-01-30 16:48:00 -0.0902 -2.5178

另一种可能的更短的方法是将采样频率指定为时间增量。但问题,如下所示,它提供了11个子样本而不是10个。我相信原因是resample实现了left-inclusive/right-exclusive (or left-exclusive/right-inclusive)子采样方案,以便最后一个at' 2015-02-03 00:00:00'被视为一个单独的组。如果我们使用pd.cut自己执行此操作,我们可以指定include_lowest=True,以便它只给出10个子样本而不是11个。

n = 10
time_delta_str = str((df.index[-1] - df.index[0]) / (pd.Timedelta('1s') * n)) + 's'
df.resample(pd.Timedelta(time_delta_str), how='sum')

Out[114]: 
                          A       B
2015-01-01 00:00:00  8.6133  2.7734
2015-01-04 07:12:00  1.9220 -0.8069
2015-01-07 14:24:00 -8.1334  0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00  1.1957  7.2285
2015-01-17 12:00:00  3.2485  6.6841
2015-01-20 19:12:00 -0.8903  2.2802
2015-01-24 02:24:00 -2.1025  1.3800
2015-01-27 09:36:00 -1.1017  1.3108
2015-01-30 16:48:00 -2.2534 -3.8543
2015-02-03 00:00:00  2.1632  1.3365

答案 1 :(得分:1)

import numpy as np
import pandas as pd

n = 10
nrows = 33
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
print(df)
#             0
# 2000-01-01  1
# 2000-01-02  1
# ...
# 2000-02-01  1
# 2000-02-02  1

first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')
secs = int((last-first).total_seconds()//n)
periodsize = '{:d}S'.format(secs)

result = df.resample(periodsize, how='sum')
print('\n{}'.format(result))
assert len(result) == n

产量

                     0
2000-01-01 00:00:00  4
2000-01-04 07:12:00  3
2000-01-07 14:24:00  3
2000-01-10 21:36:00  4
2000-01-14 04:48:00  3
2000-01-17 12:00:00  3
2000-01-20 19:12:00  4
2000-01-24 02:24:00  3
2000-01-27 09:36:00  3
2000-01-30 16:48:00  3

0 - 列中的值表示聚合的行数,因为原始的DataFrame填充了值1. 4和3的模式大致相同,自33以来行不能均匀分组为10组。

解释:考虑一下这个更简单的DataFrame:

n = 2
nrows = 5
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
#             0
# 2000-01-01  1
# 2000-01-02  1
# 2000-01-03  1
# 2000-01-04  1
# 2000-01-05  1

使用df.resample('2D', how='sum')提供错误的组数

In [366]: df.resample('2D', how='sum')
Out[366]: 
            0
2000-01-01  2
2000-01-03  2
2000-01-05  1

使用df.resample('3D', how='sum')可以提供正确数量的组,但是 第二组从2000-01-04开始,不均匀划分DataFrame 成两个等间隔组:

In [367]: df.resample('3D', how='sum')
Out[367]: 
            0
2000-01-01  3
2000-01-04  2

为了做得更好,我们需要以比几天更好的时间分辨率工作。由于Timedelta有一个total_seconds方法,让我们在几秒钟内完成工作。因此,对于上面的示例,期望的频率字符串将是

In [374]: df.resample('216000S', how='sum')
Out[374]: 
                     0
2000-01-01 00:00:00  3
2000-01-03 12:00:00  2

因为5天内有216000 * 2秒:

In [373]: (pd.Timedelta(days=5) / pd.Timedelta('1S'))/2
Out[373]: 216000.0

好的,现在我们所需要的只是一种概括的方法。我们需要索引中的最小和最大日期:

first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')

我们增加了一天,因为它使得天数差异正确。在 上面的示例,2000-01-05的时间戳之间只有4天 和2000-01-01,

In [377]: (pd.Timestamp('2000-01-05')-pd.Timestamp('2000-01-01')).days
Out[378]: 4

但正如我们在工作示例中所看到的,DataFrame有5行代表5 天。所以我们需要额外增加一天是有意义的。

现在我们可以使用以下方法计算每个等间距组中的正确秒数:

secs = int((last-first).total_seconds()//n)