每日数据,每3天重新采样一次,有效计算过去5天

时间:2016-10-24 01:21:07

标签: python pandas numpy

考虑df

tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
df

我想计算一个过去5天,每3天的总和。

我希望看起来像这样的东西

enter image description here

已编辑
我所拥有的是不正确的。 @ivan_pozdeev和@boud注意到这是一个居中的窗口,这不是我的意图。混淆的应用。
每个人的解决方案都捕获了我所追求的大部分内容。

条件

  • 我正在寻找能够扩展到大型数据集的智能高效解决方案。

  • 我会定时解决方案,也会考虑优雅。

  • 对于各种样本和回溯频率,也应该推广解决方案。

来自评论的

  • 我想要一个解决方案,它可以概括地处理指定频率的回顾并抓住那些回头看的东西。
    • 对于上面的示例,回顾是5D,并且可能有4或50个观察值属于该回顾。
  • 我希望时间戳是回顾期内最后一次观察到的时间戳。

6 个答案:

答案 0 :(得分:9)

你给我们的df是:

             A
2012-12-31   0
2013-01-01   1
2013-01-02   2
2013-01-03   3
2013-01-04   4
2013-01-05   5
2013-01-06   6
2013-01-07   7
2013-01-08   8
2013-01-09   9
2013-01-10  10

您可以创建滚动的5天总和系列,然后重新取样。我想不出比这更有效的方法。总的来说,这应该是相对时间有效的。

df.rolling(5,min_periods=5).sum().dropna().resample('3D').first()
Out[36]: 
                 A
2013-01-04 10.0000
2013-01-07 25.0000
2013-01-10 40.0000

答案 1 :(得分:5)

这里列出了两个 三个基于NumPy的解决方案,使用基于bin的求和基本上涵盖了三个场景。

场景#1:每个日期有多个条目,但没有错过日期

方法#1:

# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app1(df):
    # Extract the index names and values
    vals = df.A.values
    indx = df.index.values

    # Extract IDs for bin based summing
    mask = np.append(False,indx[1:] > indx[:-1])
    date_id = mask.cumsum()
    search_id = np.hstack((0,np.arange(2,date_id[-1],3),date_id[-1]+1))
    shifts = np.searchsorted(date_id,search_id)
    reps = shifts[1:] - shifts[:-1]
    id_arr = np.repeat(np.arange(len(reps)),reps)

    # Perform bin based summing and subtract the repeated ones
    IDsums = np.bincount(id_arr,vals)
    allsums = IDsums[:-1] + IDsums[1:]
    allsums[1:] -= np.bincount(date_id,vals)[search_id[1:-2]]

    # Convert to pandas dataframe if needed
    out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
    return pd.DataFrame(allsums,index=out_index,columns=['A'])

方法#2:

# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app2(df):
    # Extract the index names and values
    indx = df.index.values

    # Extract IDs for bin based summing
    mask = np.append(False,indx[1:] > indx[:-1])
    date_id = mask.cumsum()

    # Generate IDs at which shifts are to happen for a (2,3,5,8..) patttern    
    # Pad with 0 and length of array at either ends as we use diff later on
    shiftIDs = (np.arange(2,date_id[-1],3)[:,None] + np.arange(2)).ravel()
    search_id = np.hstack((0,shiftIDs,date_id[-1]+1))

    # Find the start of those shifting indices    
    # Generate ID based on shifts and do bin based summing of dataframe
    shifts = np.searchsorted(date_id,search_id)
    reps = shifts[1:] - shifts[:-1]
    id_arr = np.repeat(np.arange(len(reps)),reps)    
    IDsums = np.bincount(id_arr,df.A.values)

    # Sum each group of 3 elems with a stride of 2, make dataframe if needed
    allsums = IDsums[:-1:2] + IDsums[1::2] + IDsums[2::2]    

    # Convert to pandas dataframe if needed
    out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
    return pd.DataFrame(allsums,index=out_index,columns=['A'])

方法#3:

def vectorized_app3(df, S=3, W=5):
    dt = df.index.values
    shifts = np.append(False,dt[1:] > dt[:-1])
    c = np.bincount(shifts.cumsum(),df.A.values)
    out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
    out_index = dt[np.nonzero(shifts)[0][W-2::S]]
    return pd.DataFrame(out,index=out_index,columns=['A'])

我们可以用直接切片求和替换卷积部分来修改它的版本 -

def vectorized_app3_v2(df, S=3, W=5):  
    dt = df.index.values
    shifts = np.append(False,dt[1:] > dt[:-1])
    c = np.bincount(shifts.cumsum(),df.A.values)
    f = c.size+S-W
    out = c[:f:S].copy()
    for i in range(1,W):
        out += c[i:f+i:S]
    out_index = dt[np.nonzero(shifts)[0][W-2::S]]
    return pd.DataFrame(out,index=out_index,columns=['A'])

场景#2:每个日期和缺少日期的多个条目

方法#4:

def vectorized_app4(df, S=3, W=5):
    dt = df.index.values
    indx = np.append(0,((dt[1:] - dt[:-1])//86400000000000).astype(int)).cumsum()
    WL = ((indx[-1]+1)//S)
    c = np.bincount(indx,df.A.values,minlength=S*WL+(W-S))
    out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
    grp0_lastdate = dt[0] + np.timedelta64(W-1,'D')
    freq_str = str(S)+'D'
    grp_last_dt = pd.date_range(grp0_lastdate, periods=WL, freq=freq_str).values
    out_index = dt[dt.searchsorted(grp_last_dt,'right')-1]
    return pd.DataFrame(out,index=out_index,columns=['A'])

场景#3:连续日期和每个日期恰好一个条目

方法#5:

def vectorized_app5(df, S=3, W=5):
    vals = df.A.values
    N = (df.shape[0]-W+2*S-1)//S
    n = vals.strides[0]
    out = np.lib.stride_tricks.as_strided(vals,shape=(N,W),\
                                        strides=(S*n,n)).sum(1)
    index_idx = (W-1)+S*np.arange(N)
    out_index = df.index[index_idx]
    return pd.DataFrame(out,index=out_index,columns=['A'])

有关创建测试数据的建议

场景#1:

# Setup input for multiple dates, but no missing dates
S = 4 # Stride length (Could be edited)
W = 7 # Window length (Could be edited)
datasize = 3  # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
start_df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
reps = np.random.randint(1,4,(len(start_df)))
idx0 = np.repeat(start_df.index,reps)
df_data = np.random.randint(0,9,(len(idx0)))
df = pd.DataFrame(df_data,index=idx0,columns=['A'])

场景#2:

要为多个日期和缺少日期创建设置,我们只需编辑df_data创建步骤,就像这样 -

df_data = np.random.randint(0,9,(len(idx0)))

场景#3:

# Setup input for exactly one entry per date
S = 4 # Could be edited
W = 7
datasize = 3  # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)

答案 2 :(得分:4)

仅适用于规则间隔的日期

这里有两种方法,第一种是pandas方式,第二种是numpy函数。

>>> n=5   # trailing periods for rolling sum
>>> k=3   # frequency of rolling sum calc

>>> df.rolling(n).sum()[-1::-k][::-1]

               A
2013-01-01   NaN
2013-01-04  10.0
2013-01-07  25.0
2013-01-10  40.0

这是一个numpy函数(改编自Jaime's numpy moving_average):

def rolling_sum(a, n=5, k=3):
    ret = np.cumsum(a.values)
    ret[n:] = ret[n:] - ret[:-n]
    return pd.DataFrame( ret[n-1:][-1::-k][::-1], 
                         index=a[n-1:][-1::-k][::-1].index )

rolling_sum(df,n=6,k=4)   # default n=5, k=3

对于不规则间隔日期(或规则间隔)

简单地先于:

df.resample('D').sum().fillna(0)

例如,上述方法变为:

df.resample('D').sum().fillna(0).rolling(n).sum()[-1::-k][::-1]

rolling_sum( df.resample('D').sum().fillna(0) )

请注意,处理不规则间隔的日期可以在熊猫中简单而优雅地完成,因为这是熊猫几乎所有其他东西的力量。但你可能会发现一种numpy(或numba或cython)方法,它可以通过一些简单的方法来提高速度。当然,这是否是一个很好的权衡取决于您的数据大小和性能要求。

对于不规则间隔的日期,我测试了以下示例数据,它似乎正常工作。这将在每个日期产生缺失,单个和多个条目的混合:

np.random.seed(12345)
per = 11
tidx = np.random.choice( pd.date_range('2012-12-31', periods=per, freq='D'), per )
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx).sort_index()

答案 3 :(得分:3)

这还不是很完美,但我今晚必须为一个haloween派对做假血......你应该能够通过评论看到我得到的东西。最大的加速之一是找到np.searchsorted的窗口边缘。它还没有完成,但我敢打赌它只是一些需要调整的索引偏移

import pandas as pd
import numpy as np

tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)

sample_freq = 3 #days
sample_width = 5 #days

sample_freq *= 86400 #seconds per day
sample_width *= 86400 #seconds per day

times = df.index.astype(np.int64)//10**9  #array of timestamps (unix time)
cumsum = np.cumsum(df.A).as_matrix()  #array of cumulative sums (could eliminate extra summation with large overlap)
mat = np.array([times, cumsum]) #could eliminate temporary times and cumsum vars

def yieldstep(mat, freq):
    normtime = ((mat[0] - mat[0,0]) / freq).astype(int) #integer numbers indicating sample number
    for i in range(max(normtime)+1):
        yield np.searchsorted(normtime, i) #yield beginning of window index

def sumwindow(mat,i , width): #i is the start of the window returned by yieldstep
    normtime  = ((mat[0,i:] - mat[0,i])/ width).astype(int) #same as before, but we norm to window width
    j = np.searchsorted(normtime, i, side='right')-1 #find the right side of the window
    #return rightmost timestamp of window in seconds from unix epoch and sum of window
    return mat[0,j], mat[1,j] - mat[1,i] #sum of window is just end - start because we did a cumsum earlier

windowed_sums = np.array([sumwindow(mat, i, sample_width) for i in yieldstep(mat, sample_freq)])

答案 4 :(得分:3)

看起来像一个滚动的居中窗口,您可以每n天获取一次数据:

Null

答案 5 :(得分:3)

如果数据框按日期排序,我们实际拥有的是在计算某些内容时迭代数组。

这是在数组上一次迭代中计算总和的算法。要理解它,请参阅下面的笔记扫描。 这是用于展示算法的基础,未经优化的版本 (Python和Cython的优化版本)list(<call>)需要~500 ms我系统上的100k数组(P4)。由于Python整数和范围相对较慢,因此从转移到C级别时,这将大大受益。

from __future__ import division
import numpy as np

#The date column is unimportant for calculations.
# I leave extracting the numbers' column from the dataframe
# and adding a corresponding element from data column to each result
# as an exercise for the reader
data = np.random.randint(100,size=100000)

def calc_trailing_data_with_interval(data,n,k):
    """Iterate over `data', computing sums of `n' trailing elements
    for each `k'th element.
    @type data: ndarray
    @param n: number of trailing elements to sum up
    @param k: interval with which to calculate sums
    """
    lim_index=len(data)-k+1

    nsums = int(np.ceil(n/k))
    sums = np.zeros(nsums,dtype=data.dtype)
    M=n%k
    Mp=k-M

    index=0
    currentsum=0

    while index<lim_index:
        for _ in range(Mp):
            #np.take is awkward, requiring a full list of indices to take
            for i in range(currentsum,currentsum+nsums-1):
                sums[i%nsums]+=data[index]
            index+=1
        for _ in range(M):
            sums+=data[index]
            index+=1
        yield sums[currentsum]
        currentsum=(currentsum+1)%nsums
  • 请注意,它会在k元素处产生第一个和,而不是n(这可以通过牺牲优雅来改变 - 在主循环之前进行一些虚拟迭代 - 并且更优雅通过在data之前加上额外的零并丢弃一些第一笔金额来完成
  • 通过将sums[slice]+=data[index]替换为operation(sums[slice],data[index]),其中operation是一个参数,并且应该是一个变异操作(如ndarray.__iadd__),可以很容易地将其推广到任何操作。
  • 通过拆分数据在任何数字或工人之间进行并行化非常简单(如果n>k,第一个之后的块应该在开始时提供额外的元素)

为了推断算法,我为一个案例编写了一个样本,其中同时计算了相当数量的总和,以便查看模式(点击图片查看全尺寸)

notes outlining a case n=11. k=3

优化:纯Python

缓存range个对象会将时间缩短到~300ms。令人惊讶的是,numpy功能无济于事:np.take无法使用,并且用静态切片替换currentsum逻辑,np.roll是回归。更令人惊讶的是,将输出保存到np.empty而不是yield的好处是不存在的。

def calc_trailing_data_with_interval(data,n,k):
    """Iterate over `data', computing sums of `n' trailing elements
    for each `k'th element.
    @type data: ndarray
    @param n: number of trailing elements to sum up
    @param k: interval with which to calculate sums
    """
    lim_index=len(data)-k+1

    nsums = int(np.ceil(n/k))
    sums = np.zeros(nsums,dtype=data.dtype)
    M=n%k
    Mp=k-M
    RM=range(M)     #cache for efficiency
    RMp=range(Mp)   #cache for efficiency

    index=0
    currentsum=0
    currentsum_ranges=[range(currentsum,currentsum+nsums-1)
            for currentsum in range(nsums)]     #cache for efficiency

    while index<lim_index:
        for _ in RMp:
            #np.take is unusable as it allocates another array rather than view
            for i in currentsum_ranges[currentsum]:
                sums[i%nsums]+=data[index]
            index+=1
        for _ in RM:
            sums+=data[index]
            index+=1
        yield sums[currentsum]
        currentsum=(currentsum+1)%nsums

优化:Cython

静态输入Cython中的所有内容可以立即加速150ms。并且(可选)假设np.intdtype能够处理C级数据,将时间缩短至~11ms。此时,保存为np.empty确实有所帮助,保存难以置信的 ~6.5ms,总计~5.5ms

def calc_trailing_data_with_interval(np.ndarray data,int n,int k):
    """Iterate over `data', computing sums of `n' trailing elements
    for each `k'th element.
    @type data: 1-d ndarray
    @param n: number of trailing elements to sum up
    @param k: interval with which to calculate sums
    """
    if not data.ndim==1: raise TypeError("One-dimensional array required")
    cdef int lim_index=data.size-k+1

    cdef np.ndarray result = np.empty(data.size//k,dtype=data.dtype)
    cdef int rindex = 0

    cdef int nsums = int(np.ceil(float(n)/k))
    cdef np.ndarray sums = np.zeros(nsums,dtype=data.dtype)

    #optional speedup for dtype=np.int
    cdef bint use_int_buffer = data.dtype==np.int and data.flags.c_contiguous
    cdef int[:] cdata = data
    cdef int[:] csums = sums
    cdef int[:] cresult = result

    cdef int M=n%k
    cdef int Mp=k-M

    cdef int index=0
    cdef int currentsum=0

    cdef int _,i
    while index<lim_index:
        for _ in range(Mp):
            #np.take is unusable as it allocates another array rather than view
            for i in range(currentsum,currentsum+nsums-1):
                if use_int_buffer:  csums[i%nsums]+=cdata[index]    #optional speedup
                else:               sums[i%nsums]+=data[index]
            index+=1
        for _ in range(M):
            if use_int_buffer:
                for i in range(nsums): csums[i]+=cdata[index]   #optional speedup
            else:               sums+=data[index]
            index+=1

        if use_int_buffer:  cresult[rindex]=csums[currentsum]     #optional speedup
        else:               result[rindex]=sums[currentsum]
        currentsum=(currentsum+1)%nsums
        rindex+=1
    return result