在pandas DataFrame中计算特定频率内的TimeDelta

时间:2018-03-02 11:51:43

标签: python pandas

我有一个像这样的pandas DataFrame(实际的DataFrame有几十万行):

                            td
2011-08-14 09:09:14   00:00:13
2011-08-14 09:09:27   00:02:25
2011-08-14 09:11:52   00:00:05
2011-08-14 09:11:57   00:20:41
2011-08-14 09:32:38   00:03:05
2011-08-14 09:35:43   00:05:44
2011-08-14 09:41:27   00:07:07
2011-08-14 09:48:34   00:01:51
2011-08-14 09:50:25   00:06:08
2011-08-14 09:56:33   01:08:39
2011-08-14 10:05:12   00:04:51
2011-08-14 10:10:03   00:06:36
2011-08-14 10:16:39   00:00:13
2011-08-14 10:16:52   00:18:25
2011-08-14 10:35:17   00:00:05
2011-08-14 10:35:22   00:24:24
2011-08-14 10:59:46   00:27:44

现在我想将索引重新采样为小时,如下所示:

2011-08-14 09:00:00   01:55:58
2011-08-14 10:00:00   00:00:00
2011-08-14 11:00:00   01:22:18
Freq: H, Name: td, dtype: timedelta64[ns]

但是我需要将生成的timedelta与频率对齐,所以在这个例子中要小时!期望的结果应如下所示:

2011-08-14 09:00:00   01:00:00
2011-08-14 10:00:00   00:55:58    # <- carryover from previous row
2011-08-14 11:00:00   01:00:00
2011-08-14 12:00:00   00:22:18    # <- carryover from previous row
Freq: H, Name: td, dtype: timedelta64[ns]

这是一个简单的代码剪切:

import pandas as pd

index = [
    '2011-08-14 09:09:14',
    '2011-08-14 09:09:27',
    '2011-08-14 09:11:52',
    '2011-08-14 09:11:57',
    '2011-08-14 09:32:38',
    '2011-08-14 09:35:43',
    '2011-08-14 09:41:27',
    '2011-08-14 09:48:34',
    '2011-08-14 09:50:25',
    '2011-08-14 09:56:33',
    '2011-08-14 11:05:12',
    '2011-08-14 11:10:03',
    '2011-08-14 11:16:39',
    '2011-08-14 11:16:52',
    '2011-08-14 11:35:17',
    '2011-08-14 11:35:22',
    '2011-08-14 11:59:46',
    '2011-08-14 11:59:46'
    ]

data = [
       13000000000,
      145000000000,
        5000000000,
     1241000000000,
      185000000000,
      344000000000,
      427000000000,
      111000000000,
      368000000000,
     4119000000000,
      291000000000,
      396000000000,
       13000000000,
     1105000000000,
        5000000000,
     1464000000000,
     1664000000000,
        0000000000
    ]

df = pd.DataFrame(data, columns=['td'], index=pd.DatetimeIndex(index), dtype='timedelta64[ns]')

print(df)
print(df.resample('H').td.sum())

2 个答案:

答案 0 :(得分:0)

这是我的解决方案。 基本上,每次从前一天(timedelta减1小时)添加结转,并将前一天的时间限制为1小时。

最后,如果最后一次增量超过1小时,您可能还需要扩展列表。

代码可能更干,但这应该让你走上正轨:

resampled = df.resample('H').td.sum()

# Initialise output. Make copy as we will modify values in-place
out = resampled.astype(pd.Timedelta).copy().values.tolist()
extended_idx = resampled.index.tolist()

def days_hours_minutes_seconds(td):
    return td.days, td.seconds//3600, (td.seconds//60)%60, td.seconds%60

def carry_over(td):
    # Calculate carry-over as excess of 1 hour

    days, hours, minutes, seconds = days_hours_minutes_seconds(td)

    if hours >=1:
        return pd.Timedelta('%d days %d hours %d min %d sec' % (days, hours - 1, minutes, seconds))
    else:
        return pd.Timedelta(0)

# Carry over
for idx in range(1, len(out)):

    prev = out[idx-1]

    out[idx] += carry_over(prev)
    out[idx-1] = min(prev, pd.Timedelta('1 hours'))

# Extend the list if last time delta is more than 1 hour
done = out[-1] <= pd.Timedelta('1 hours')

while not done:

    extended_idx.append(extended_idx[-1] + pd.Timedelta('1 hours'))

    out.append(carry_over(out[-1]))
    out[-2] = min(out[-2], pd.Timedelta('1 hours'))

    if out[-1] <= pd.Timedelta('1 hours'):
        done = True

out = pd.Series(out, index=extended_idx)

答案 1 :(得分:0)

半矢量化方法

df2 = df.resample('H').td.sum().fillna(pd.Timedelta(0))
limit = pd.Timedelta('1H')

while((df2 > limit).any()):
    df3 = df2.shift()
    last = df2.index[-1]
    if  df2[last] > limit:
        df2[last + limit] = df2[last] - limit
        df3[last + limit] = df2[last] - limit
    carry_over = df3 > limit
    df2.loc[df2 > limit] = limit
    df2[carry_over] = df2[carry_over] + df3.loc[carry_over] - limit
2011-08-14 09:00:00   01:00:00
2011-08-14 10:00:00   00:55:58
2011-08-14 11:00:00   01:00:00
2011-08-14 12:00:00   00:22:18
Freq: H, Name: td, dtype: timedelta64[ns]