Question

假设我有一个pandas.DataFrame，其中包含3天的每小时数据：

import pandas as pd
import numpy as np
import datetime as dt
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(np.random.randn(3*24,2),index=dates,columns=list('AB'))

我想得到每一个，比方说，6个小时的数据，并独立地拟合该数据的曲线。由于pandas'resample函数的how关键字应该是any numpy array function，我认为我可以尝试使用resample来对polyfit执行此操作，但显然没有办法（对吧？）。

所以我想到的唯一替代方法是将df分成DataFrame s序列，所以我试图创建一个可以工作的函数，例如

l=splitDF(df, '6H')

它将返回给我一个数据帧列表，每个数据帧有6个小时的数据（除了第一个和最后一个）。到目前为止，除了以下手动方法之外，我没有任何可行的工作：

def splitDF(data, rule):
        res_index=data.resample(rule).index
        out=[]
        cont=0
        for date in data.index:
              ... check for date in res_index ...
              ... and start cutting at those points ...

但是这种方法会非常慢，并且可能有更快的方法。这样做有快速（甚至是pythonic）的方法吗？

谢谢！

修改的

更好的方法（需要一些改进，但速度更快）如下：

def splitDF(data, rule):
    res_index=data.resample(rule).index
    out=[]
    pdate=res_index[0]
    for date in res_index:
            out.append(data[pdate:date][:-1])
            pdate=date
    out.append(data[pdate:])
    return out

但在我看来应该有一个更好的方法。

Answer 1

好的，所以这听起来像是使用groupby的教科书案例。这就是我的想法：

import pandas as pd

#let's define a function that'll group a datetime-indexed dataframe by hour-interval/date
def create_date_hour_groups(df, hr):
     new_df = df.copy()
     hr_int = int(hr)
     new_df['hr_group'] = new_df.index.hour/hr_int
     new_df['dt_group'] = new_df.index.date
     return new_df

#now we define a wrapper for polyfit to pass to groupby.apply
def polyfit_x_y(df, x_col='A', y_col='B', poly_deg=3):
    df_new = df.copy()
    coef_array = pd.np.polyfit(df_new[x_col], df_new[y_col], poly_deg)
    poly_func = pd.np.poly1d(coef_array)
    df_new['poly_fit'] = poly_func(df[x_col])
    return df_new

#to the actual stuff
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(pd.np.random.randn(3*24,2),index=dates,columns=list('AB'))
df = create_date_hour_groups(df, 6)
df_fit = df.groupby(['dt_group', 'hr_group'],
                    as_index=False).apply(polyfit_x_y)

Answer 2

怎么样？

np.array_split(df,len(df)/6)

通过偏移字符串

2 个答案: