我有时间序列数据,格式如本帖子底部所示。
我想将数据重新采样到30分钟的时间间隔,但我需要将状态时间值相应地分割为正确的间隔(这些值以整秒表示)。
现在假设某一行的状态是2342秒(超过30分钟),并说开始时间是08:22:00。
User Start Date Start Time State Time in State (secs)
J.Doe 03-02-2014 08:22:00 A 2342
当重新抽样完成后,我需要将状态时间相应地分成它溢出的时间段,如下所示:
User Start Date Time Period State Time in State (secs)
J.Doe 03-02-2014 08:00:00 A 480
J.Doe 03-02-2014 08:30:00 A 1800
J.Doe 03-02-2014 09:00:00 A 62
480 + 1800 + 62 = 2342
我完全失去了如何在熊猫中实现这一目标......我将不胜感激任何帮助: - )
源数据格式:
User Start Date Start Time State Time in State (secs)
J.Doe 03-02-2014 07:58:00 A 36
J.Doe 03-02-2014 07:59:00 A 43
J.Doe 03-02-2014 08:00:00 A 59
J.Doe 03-02-2014 08:01:00 A 32
J.Doe 03-02-2014 08:21:00 A 15
J.Doe 03-02-2014 08:22:00 B 3
J.Doe 03-02-2014 08:22:00 A 2342
J.Doe 03-02-2014 09:01:00 B 1
J.Doe 03-02-2014 09:01:00 A 375
J.Doe 03-02-2014 09:07:00 B 3
J.Doe 03-02-2014 09:07:00 A 6408
J.Doe 03-02-2014 10:54:00 B 2
J.Doe 03-02-2014 10:54:00 A 116
J.Doe 03-02-2014 10:58:00 B 2
J.Doe 03-02-2014 10:58:00 A 122
J.Doe 03-02-2014 10:58:00 A 12
J.Doe 03-02-2014 11:00:00 B 2
J.Doe 03-02-2014 11:00:00 A 3417
J.Doe 03-02-2014 11:57:00 B 3
J.Doe 03-02-2014 11:57:00 A 120
J.Doe 03-02-2014 11:59:00 C 165
J.Doe 03-02-2014 12:02:00 B 3
J.Doe 03-02-2014 12:02:00 A 7254
答案 0 :(得分:1)
我首先创建Start和End列(作为datetime64对象):
In [11]: df['Start'] = pd.to_datetime(df['Start Date'] + ' ' + df['Start Time'])
In [12]: df['End'] = df['Start'] + df['Time in State (secs)'].apply(pd.offsets.Second)
In [13]: row = df.iloc[6, :]
In [14]: row
Out[14]:
User J.Doe
Start Date 03-02-2014
Start Time 08:22:00
State A
Time in State (secs) 2342
Start 2014-03-02 08:22:00
End 2014-03-02 09:01:02
Name: 6, dtype: object
获得分割时间的一种方法是从开始和结束重新取样,合并,并使用diff:
def split_times(row):
y = pd.Series(0, [row['Start'], row['End']])
splits = y.resample('30min').index + y.index # this fills in middle and sorts too
res = -splits.to_series().diff(-1)
if len(res) > 2: res = res[1:-1]
elif len(res) == 2: res = res[1:]
return res.astype(int).resample('30min').astype(np.timedelta64) # hack to resample again
In [16]: split_times(row)
Out[16]:
2014-03-02 08:22:00 00:08:00
2014-03-02 08:30:00 00:30:00
2014-03-02 09:00:00 00:01:02
dtype: timedelta64[ns]
In [17]: df.apply(split_times, 1)
Out[17]:
2014-03-02 07:30:00 2014-03-02 08:00:00 2014-03-02 08:30:00 2014-03-02 09:00:00 2014-03-02 09:30:00 2014-03-02 10:00:00 2014-03-02 10:30:00 2014-03-02 11:00:00 2014-03-02 11:30:00 2014-03-02 12:00:00 2014-03-02 12:30:00 2014-03-02 13:00:00 2014-03-02 13:30:00 2014-03-02 14:00:00
0 00:00:36 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
1 00:00:43 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
2 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
3 NaT 00:00:32 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
4 NaT 00:00:15 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
5 NaT 00:00:03 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
6 NaT 00:08:00 00:30:00 00:01:02 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
7 NaT NaT NaT 00:00:01 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
8 NaT NaT NaT 00:06:15 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
9 NaT NaT NaT 00:00:03 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
10 NaT NaT NaT 00:23:00 00:30:00 00:30:00 00:23:48 NaT NaT NaT NaT NaT NaT NaT
11 NaT NaT NaT NaT NaT NaT 00:00:02 NaT NaT NaT NaT NaT NaT NaT
12 NaT NaT NaT NaT NaT NaT 00:01:56 NaT NaT NaT NaT NaT NaT NaT
13 NaT NaT NaT NaT NaT NaT 00:00:02 NaT NaT NaT NaT NaT NaT NaT
14 NaT NaT NaT NaT NaT NaT 00:02:00 00:00:02 NaT NaT NaT NaT NaT NaT
15 NaT NaT NaT NaT NaT NaT 00:00:12 NaT NaT NaT NaT NaT NaT NaT
16 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
17 NaT NaT NaT NaT NaT NaT NaT NaT 00:26:57 NaT NaT NaT NaT NaT
18 NaT NaT NaT NaT NaT NaT NaT NaT 00:00:03 NaT NaT NaT NaT NaT
19 NaT NaT NaT NaT NaT NaT NaT NaT 00:02:00 NaT NaT NaT NaT NaT
20 NaT NaT NaT NaT NaT NaT NaT NaT 00:01:00 00:01:45 NaT NaT NaT NaT
21 NaT NaT NaT NaT NaT NaT NaT NaT NaT 00:00:03 NaT NaT NaT NaT
22 NaT NaT NaT NaT NaT NaT NaT NaT NaT 00:28:00 00:30:00 00:30:00 00:30:00 00:02:54
要用0替换NaT,看起来你必须在0.13.1中做一些摆弄(这可能已在master中修复,否则是一个bug):
res2 = df.apply(split_times, 1).astype(int)
# hack to replace NaTs with 0
res2.where(res2 != -9223372036854775808, 0).astype(np.timedelta64)
# to just get the seconds
seconds = res2.where(res2 != -9223372036854775808, 0) / 10 ** 9