Question

我有一个带有日期时间索引的大型数据框，需要将数据重新采样到10个相同大小的时段。

到目前为止，我已经尝试找到第一个和最后一个日期来确定数据中的总天数，将其除以10以确定每个期间的大小，然后使用该天数重新采样。例如：

first = df.reset_index().timesubmit.min()
last = df.reset_index().timesubmit.max()
periodsize = str((last-first).days/10) + 'D'

df.resample(periodsize,how='sum')

这不能保证重新采样后df中的10个周期，因为周期大小是向下舍入的int。使用浮子不会在重采样中起作用。似乎我在这里缺少一些简单的东西，或者我对这个问题的攻击都是错误的。

Answer 1

以下是使用np.linspace()上的pd.Timedelta然后使用pd.cut将每个障碍分类到不同的分区来确保相等大小的子句点的一种方法。

import pandas as pd
import numpy as np

# generate artificial data
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 2), columns=['A', 'B'], index=pd.date_range('2015-01-01 00:00:00', periods=100, freq='8H'))

Out[87]: 
                          A       B
2015-01-01 00:00:00  1.7641  0.4002
2015-01-01 08:00:00  0.9787  2.2409
2015-01-01 16:00:00  1.8676 -0.9773
2015-01-02 00:00:00  0.9501 -0.1514
2015-01-02 08:00:00 -0.1032  0.4106
2015-01-02 16:00:00  0.1440  1.4543
2015-01-03 00:00:00  0.7610  0.1217
2015-01-03 08:00:00  0.4439  0.3337
2015-01-03 16:00:00  1.4941 -0.2052
2015-01-04 00:00:00  0.3131 -0.8541
2015-01-04 08:00:00 -2.5530  0.6536
2015-01-04 16:00:00  0.8644 -0.7422
2015-01-05 00:00:00  2.2698 -1.4544
2015-01-05 08:00:00  0.0458 -0.1872
2015-01-05 16:00:00  1.5328  1.4694
...                     ...     ...
2015-01-29 08:00:00  0.9209  0.3187
2015-01-29 16:00:00  0.8568 -0.6510
2015-01-30 00:00:00 -1.0342  0.6816
2015-01-30 08:00:00 -0.8034 -0.6895
2015-01-30 16:00:00 -0.4555  0.0175
2015-01-31 00:00:00 -0.3540 -1.3750
2015-01-31 08:00:00 -0.6436 -2.2234
2015-01-31 16:00:00  0.6252 -1.6021
2015-02-01 00:00:00 -1.1044  0.0522
2015-02-01 08:00:00 -0.7396  1.5430
2015-02-01 16:00:00 -1.2929  0.2671
2015-02-02 00:00:00 -0.0393 -1.1681
2015-02-02 08:00:00  0.5233 -0.1715
2015-02-02 16:00:00  0.7718  0.8235
2015-02-03 00:00:00  2.1632  1.3365

[100 rows x 2 columns]


# cutoff points, 10 equal-size group requires 11 points
# measured by timedelta 1 hour
time_delta_in_hours = (df.index - df.index[0]) / pd.Timedelta('1h')
n = 10
ts_cutoff = np.linspace(0, time_delta_in_hours[-1], n+1)
# labels, time index
time_index = df.index[0] + np.array([pd.Timedelta(str(time_delta)+'h') for time_delta in ts_cutoff])

# create a categorical reference variables
df['start_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[:-1])
# for clarity, reassign labels using end-period index
df['end_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[1:])

Out[89]: 
                          A       B    start_time_index      end_time_index
2015-01-01 00:00:00  1.7641  0.4002 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 08:00:00  0.9787  2.2409 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 16:00:00  1.8676 -0.9773 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 00:00:00  0.9501 -0.1514 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 08:00:00 -0.1032  0.4106 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 16:00:00  0.1440  1.4543 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 00:00:00  0.7610  0.1217 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 08:00:00  0.4439  0.3337 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 16:00:00  1.4941 -0.2052 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 00:00:00  0.3131 -0.8541 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 08:00:00 -2.5530  0.6536 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-04 16:00:00  0.8644 -0.7422 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 00:00:00  2.2698 -1.4544 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 08:00:00  0.0458 -0.1872 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 16:00:00  1.5328  1.4694 2015-01-04 07:12:00 2015-01-07 14:24:00
...                     ...     ...                 ...                 ...
2015-01-29 08:00:00  0.9209  0.3187 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-29 16:00:00  0.8568 -0.6510 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 00:00:00 -1.0342  0.6816 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 08:00:00 -0.8034 -0.6895 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 16:00:00 -0.4555  0.0175 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-31 00:00:00 -0.3540 -1.3750 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 08:00:00 -0.6436 -2.2234 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 16:00:00  0.6252 -1.6021 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 00:00:00 -1.1044  0.0522 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 08:00:00 -0.7396  1.5430 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 16:00:00 -1.2929  0.2671 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 00:00:00 -0.0393 -1.1681 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 08:00:00  0.5233 -0.1715 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 16:00:00  0.7718  0.8235 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-03 00:00:00  2.1632  1.3365 2015-01-30 16:48:00 2015-02-03 00:00:00

[100 rows x 4 columns]

df.groupby('start_time_index').agg('sum')

Out[90]: 
                          A       B
start_time_index                   
2015-01-01 00:00:00  8.6133  2.7734
2015-01-04 07:12:00  1.9220 -0.8069
2015-01-07 14:24:00 -8.1334  0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00  1.1957  7.2285
2015-01-17 12:00:00  3.2485  6.6841
2015-01-20 19:12:00 -0.8903  2.2802
2015-01-24 02:24:00 -2.1025  1.3800
2015-01-27 09:36:00 -1.1017  1.3108
2015-01-30 16:48:00 -0.0902 -2.5178

另一种可能的更短的方法是将采样频率指定为时间增量。但问题，如下所示，它提供了11个子样本而不是10个。我相信原因是resample实现了left-inclusive/right-exclusive (or left-exclusive/right-inclusive)子采样方案，以便最后一个at＆＃39; 2015-02-03 00：00：00＆＃39;被视为一个单独的组。如果我们使用pd.cut自己执行此操作，我们可以指定include_lowest=True，以便它只给出10个子样本而不是11个。

n = 10
time_delta_str = str((df.index[-1] - df.index[0]) / (pd.Timedelta('1s') * n)) + 's'
df.resample(pd.Timedelta(time_delta_str), how='sum')

Out[114]: 
                          A       B
2015-01-01 00:00:00  8.6133  2.7734
2015-01-04 07:12:00  1.9220 -0.8069
2015-01-07 14:24:00 -8.1334  0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00  1.1957  7.2285
2015-01-17 12:00:00  3.2485  6.6841
2015-01-20 19:12:00 -0.8903  2.2802
2015-01-24 02:24:00 -2.1025  1.3800
2015-01-27 09:36:00 -1.1017  1.3108
2015-01-30 16:48:00 -2.2534 -3.8543
2015-02-03 00:00:00  2.1632  1.3365

Answer 2

import numpy as np
import pandas as pd

n = 10
nrows = 33
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
print(df)
#             0
# 2000-01-01  1
# 2000-01-02  1
# ...
# 2000-02-01  1
# 2000-02-02  1

first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')
secs = int((last-first).total_seconds()//n)
periodsize = '{:d}S'.format(secs)

result = df.resample(periodsize, how='sum')
print('\n{}'.format(result))
assert len(result) == n

产量

                     0
2000-01-01 00:00:00  4
2000-01-04 07:12:00  3
2000-01-07 14:24:00  3
2000-01-10 21:36:00  4
2000-01-14 04:48:00  3
2000-01-17 12:00:00  3
2000-01-20 19:12:00  4
2000-01-24 02:24:00  3
2000-01-27 09:36:00  3
2000-01-30 16:48:00  3

0 - 列中的值表示聚合的行数，因为原始的DataFrame填充了值1. 4和3的模式大致相同，自33以来行不能均匀分组为10组。

解释：考虑一下这个更简单的DataFrame：

n = 2
nrows = 5
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
#             0
# 2000-01-01  1
# 2000-01-02  1
# 2000-01-03  1
# 2000-01-04  1
# 2000-01-05  1

使用df.resample('2D', how='sum')提供错误的组数

In [366]: df.resample('2D', how='sum')
Out[366]: 
            0
2000-01-01  2
2000-01-03  2
2000-01-05  1

使用df.resample('3D', how='sum')可以提供正确数量的组，但是第二组从2000-01-04开始，不均匀划分DataFrame 成两个等间隔组：

In [367]: df.resample('3D', how='sum')
Out[367]: 
            0
2000-01-01  3
2000-01-04  2

为了做得更好，我们需要以比几天更好的时间分辨率工作。由于Timedelta有一个total_seconds方法，让我们在几秒钟内完成工作。因此，对于上面的示例，期望的频率字符串将是

In [374]: df.resample('216000S', how='sum')
Out[374]: 
                     0
2000-01-01 00:00:00  3
2000-01-03 12:00:00  2

因为5天内有216000 * 2秒：

In [373]: (pd.Timedelta(days=5) / pd.Timedelta('1S'))/2
Out[373]: 216000.0

好的，现在我们所需要的只是一种概括的方法。我们需要索引中的最小和最大日期：

first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')

我们增加了一天，因为它使得天数差异正确。在上面的示例，2000-01-05的时间戳之间只有4天和2000-01-01，

In [377]: (pd.Timestamp('2000-01-05')-pd.Timestamp('2000-01-01')).days
Out[378]: 4

但正如我们在工作示例中所看到的，DataFrame有5行代表5 天。所以我们需要额外增加一天是有意义的。

现在我们可以使用以下方法计算每个等间距组中的正确秒数：

secs = int((last-first).total_seconds()//n)

如何使用日期时间索引将df重新采样到n个相同大小的时段？

2 个答案: