我有一个像这样的pandas DataFrame(实际的DataFrame有几十万行):
td
2011-08-14 09:09:14 00:00:13
2011-08-14 09:09:27 00:02:25
2011-08-14 09:11:52 00:00:05
2011-08-14 09:11:57 00:20:41
2011-08-14 09:32:38 00:03:05
2011-08-14 09:35:43 00:05:44
2011-08-14 09:41:27 00:07:07
2011-08-14 09:48:34 00:01:51
2011-08-14 09:50:25 00:06:08
2011-08-14 09:56:33 01:08:39
2011-08-14 10:05:12 00:04:51
2011-08-14 10:10:03 00:06:36
2011-08-14 10:16:39 00:00:13
2011-08-14 10:16:52 00:18:25
2011-08-14 10:35:17 00:00:05
2011-08-14 10:35:22 00:24:24
2011-08-14 10:59:46 00:27:44
现在我想将索引重新采样为小时,如下所示:
2011-08-14 09:00:00 01:55:58
2011-08-14 10:00:00 00:00:00
2011-08-14 11:00:00 01:22:18
Freq: H, Name: td, dtype: timedelta64[ns]
但是我需要将生成的timedelta与频率对齐,所以在这个例子中要小时!期望的结果应如下所示:
2011-08-14 09:00:00 01:00:00
2011-08-14 10:00:00 00:55:58 # <- carryover from previous row
2011-08-14 11:00:00 01:00:00
2011-08-14 12:00:00 00:22:18 # <- carryover from previous row
Freq: H, Name: td, dtype: timedelta64[ns]
这是一个简单的代码剪切:
import pandas as pd
index = [
'2011-08-14 09:09:14',
'2011-08-14 09:09:27',
'2011-08-14 09:11:52',
'2011-08-14 09:11:57',
'2011-08-14 09:32:38',
'2011-08-14 09:35:43',
'2011-08-14 09:41:27',
'2011-08-14 09:48:34',
'2011-08-14 09:50:25',
'2011-08-14 09:56:33',
'2011-08-14 11:05:12',
'2011-08-14 11:10:03',
'2011-08-14 11:16:39',
'2011-08-14 11:16:52',
'2011-08-14 11:35:17',
'2011-08-14 11:35:22',
'2011-08-14 11:59:46',
'2011-08-14 11:59:46'
]
data = [
13000000000,
145000000000,
5000000000,
1241000000000,
185000000000,
344000000000,
427000000000,
111000000000,
368000000000,
4119000000000,
291000000000,
396000000000,
13000000000,
1105000000000,
5000000000,
1464000000000,
1664000000000,
0000000000
]
df = pd.DataFrame(data, columns=['td'], index=pd.DatetimeIndex(index), dtype='timedelta64[ns]')
print(df)
print(df.resample('H').td.sum())
答案 0 :(得分:0)
这是我的解决方案。 基本上,每次从前一天(timedelta减1小时)添加结转,并将前一天的时间限制为1小时。
最后,如果最后一次增量超过1小时,您可能还需要扩展列表。
代码可能更干,但这应该让你走上正轨:
resampled = df.resample('H').td.sum()
# Initialise output. Make copy as we will modify values in-place
out = resampled.astype(pd.Timedelta).copy().values.tolist()
extended_idx = resampled.index.tolist()
def days_hours_minutes_seconds(td):
return td.days, td.seconds//3600, (td.seconds//60)%60, td.seconds%60
def carry_over(td):
# Calculate carry-over as excess of 1 hour
days, hours, minutes, seconds = days_hours_minutes_seconds(td)
if hours >=1:
return pd.Timedelta('%d days %d hours %d min %d sec' % (days, hours - 1, minutes, seconds))
else:
return pd.Timedelta(0)
# Carry over
for idx in range(1, len(out)):
prev = out[idx-1]
out[idx] += carry_over(prev)
out[idx-1] = min(prev, pd.Timedelta('1 hours'))
# Extend the list if last time delta is more than 1 hour
done = out[-1] <= pd.Timedelta('1 hours')
while not done:
extended_idx.append(extended_idx[-1] + pd.Timedelta('1 hours'))
out.append(carry_over(out[-1]))
out[-2] = min(out[-2], pd.Timedelta('1 hours'))
if out[-1] <= pd.Timedelta('1 hours'):
done = True
out = pd.Series(out, index=extended_idx)
答案 1 :(得分:0)
半矢量化方法
df2 = df.resample('H').td.sum().fillna(pd.Timedelta(0))
limit = pd.Timedelta('1H')
while((df2 > limit).any()):
df3 = df2.shift()
last = df2.index[-1]
if df2[last] > limit:
df2[last + limit] = df2[last] - limit
df3[last + limit] = df2[last] - limit
carry_over = df3 > limit
df2.loc[df2 > limit] = limit
df2[carry_over] = df2[carry_over] + df3.loc[carry_over] - limit
2011-08-14 09:00:00 01:00:00 2011-08-14 10:00:00 00:55:58 2011-08-14 11:00:00 01:00:00 2011-08-14 12:00:00 00:22:18 Freq: H, Name: td, dtype: timedelta64[ns]