Question

数据是一个带有50MM行的时间序列索引的df。使用pandas 0.18.0，没有实现使用时间增量窗口滚动，有没有办法重写它以使其更快？

data.index.map(lambda x: data.loc[x-pd.Timedelta(hours=1):x,'people'].count())

数据如下：

data.loc[:5,'people']

09/15/2017 10:00:01.123456    3
09/15/2017 10:00:01.512345    5
09/15/2017 10:00:03.010101    10
09/15/2017 12:00:10.989898    2
09/15/2017 14:00:00.000000    4

Answer 1

我没有测试过这个，但过去我在numba上取得了很大的成功。有一些预编译选项，您可以在官方文档中查找，这将消除每次编译第一次循环迭代时的延迟。您还可以在cache=True装饰器中使用jit()作为关键字，以便在运行之间保存已编译的版本。

滚动窗口函数相对简单，可以使用编译代码进行快速循环迭代。此功能应在小于或等于1小时宽（3600.0秒）的窗口上显示“人”的滚动总数。输入以numpy数组的形式给出，其中“time”为dtype=np.float64，并包含以纪元为单位的unix时间戳。 “人”再次是dtype=np.int32

的数组

import numpy as np
from numba import jit

@jit("i4[:](f8[:], i4[:], f8)") #returns a 32 bit int array with inputs: (64 bit float array, 32 bit int array, 64 bit float)
def rolling_sum(time, people, width=3600.0):
    #assuming time is sorted..
    left = 0 #left side of the window
    out = np.empty_like(time)
    running_sum = people[0]
    out[0] = running_sum #first entry

    for right in range(1,len(time)): #right side of the window
        #add next value from "people" to running sum
        running_sum += people[right]

        #move left edge to the right until window is less or equal to "width" seconds wide
        while time[right] - time[left] >= width:
            #subtract from running sum what's no longer in the window
            running_sum -= people[left]
            #shrink the window
            left += 1
        #record running sum at window position
        out[right] = running_sum
    return out

编辑：

滚动计数比滚动总和更容易：

@jit("i4[:](f8[:], f8)")
def rolling_count(time, width=3600.0):
    left = 0 
    out = np.empty_like(time)
    out[0] = 1
    for right in range(1,len(time)):
        while time[right] - time[left] >= width:
            left += 1
        out[right] = right - left + 1 #addition of 1 accounts for inclusive range
    return out

原始数据集中时间戳的精度将决定计数的准确性。我编写了使用64位浮点数作为时间戳的函数，它保证了（直到完成额外的计算）基数10中的15个有效数字。我机器上的当前时间（time.time()）读取：1505505879.4849467（EDT：UTC - 4）然而，根据浮点精度，只有1505505879.48494可以被认为是准确的，考虑到操作系统对系统时钟的更新频率，可能更少。

使用时间片在pandas中滚动计数0.18.0

1 个答案:

编辑：