Question

我正在使用时间序列中的高频数据，我想从我的数据中获取所有工作日。我的数据观察以秒为单位，因此每天有86400秒，我的数据集分布在31天（因此有2,678,400次观察！）。

以下是我的数据的（部分）：

In[1]: ts
Out[1]: 
2013-01-01 00:00:00    0.480928
2013-01-01 00:00:01    0.480928
2013-01-01 00:00:02    0.483977
2013-01-01 00:00:03    0.486725
2013-01-01 00:00:04    0.486725
...
2013-01-31 23:59:56    0.451630
2013-01-31 23:59:57    0.451630
2013-01-31 23:59:58    0.451630
2013-01-31 23:59:59    0.454683
Freq: S, Length: 2678400

我想要做的是创建一个新的时间序列，其中包含本月的工作日，但我希望将它们与相应的数据时间保持一致。 例如，如果2013-01-02（WED）到2013-01-04（周五）是1月份第一周的第一个工作日，那么：

2013-01-02 00:00:00    0.507477
2013-01-02 00:00:01    0.501373
...
2013-01-03 00:00:00    0.489778
2013-01-03 00:00:01    0.489778
...
2013-01-04 23:59:58    0.598115
2013-01-04 23:59:59    0.598115
Freq: S, Length: 259200

所以它当然会排除周六2013-01-05和2013-01-06的所有数据，因为这些是周末。等等...

我尝试使用一些pandas内置命令，但无法找到正确的pandas内置命令，因为它们在白天聚合而没有考虑到每天都包含子列。也就是说，每一秒都有一个值，它们不应该被平均，只是组合成一个新系列..

例如我试过：

ts.asfreq(BDay()) - ＆gt;查找工作日但每天的平均值
ts.resample() - ＆gt;你必须定义＆＃39;如何＆＃39; （mean，max，min ...）
ts.groupby(lambda x : x.weekday) - ＆gt;不是！
ts = pd.Series(df, index = pd.bdate_range(start = '2013/01/01 00:00:00', end = '2013/01/31 23:59:59' , freq = 'S')) - ＆GT; df因为原始数据是DataFramem。使用pd.bdate_range没有帮助，因为df和index必须在同一个维度上。

我搜索了pandas文档，谷歌搜索但找不到线索...
有人有想法吗？

我真的很感谢你的帮助！

谢谢！

P.S 我宁愿不使用循环，因为我的数据集非常大...... （我还有其他月份要分析）

Answer 1

不幸的是，这有点慢，但至少应该给出你想要的答案。

#create an index of just the date portion of your index (this is the slow step)
ts_days = pd.to_datetime(ts.index.date)

#create a range of business days over that period
bdays = pd.bdate_range(start=ts.index[0].date(), end=ts.index[-1].date())

#Filter the series to just those days contained in the business day range.
ts = ts[ts_days.isin(bdays)]

Answer 2

Modern pandas将时间戳存储为numpy.datetime64，时间单位为纳秒（可以通过检查ts.index.values来检查）。将原始索引和bdate_range生成的索引转换为每日时间单位（[D]）并检查包含在这两个数组上的速度要快得多：

import numpy as np
import pandas

def _get_days_array(index):
    "Convert the index to a datetime64[D] array"
    return index.values.astype('<M8[D]')

def retain_business_days(ts):
    "Retain only the business days"
    tsdays = _get_days_array(ts.index) 
    bdays = _get_days_array(pandas.bdate_range(tsdays[0], tsdays[-1]))
    mask = np.in1d(tsdays, bdays)
    return ts[mask]

使用Python / Pandas提取时间序列中的工作日

2 个答案: