我有一个时间序列,样本频率不规则。为了获得这方面的有用数据,我需要找到10分钟的周期,大致均匀间隔的样本(我已经定义了2个样本之间的平均时间δ小于20秒)。
示例数据: (为了这个例子,我将用avg 2s增量10秒间隔。)
| timestamp | speed | | 2010-01-01 09:20:12 | 10 | | 2010-01-01 09:20:14 | 14 | | 2010-01-01 09:20:16 | 12 | | 2010-01-01 09:20:27 | 18 | | 2010-01-01 09:20:28 | 19 | | 2010-01-01 09:20:29 | 19 |
我希望的结果是如下的分组。请注意,第二组不包含在内,因为样本在10s周期结束时聚集在一起(27,28,29),这意味着7s的隐含额外时间间隔使得平均增量3s。
| timestamp | avg | std | std_over_avg | | 2010-01-01 09:20:10 | 12 | 1.63 | 0.136 |
修改
我想我在我的问题中组合了多个东西(有些不正确),所以我想纠正/澄清我在寻找的东西。
回到示例数据,我想把它分成10个不规则的peiords;也就是说,如果存在数据间隙,则下一个10s周期应该从下一个可行的rcord的时间戳开始。 (请忽略之前提到的均匀间隔的样本,结果我误解了这个要求,如果需要,我总是可以在以后的阶段过滤掉它)。所以我想要这样的东西:
| period | count | avg | std | std_over_avg | | 2010-01-01 09:20:12 - 2010-01-01 09:20:22 | 3 | 12 | 1.63 | 0.136 | | 2010-01-01 09:20:27 - 2010-01-01 09:20:37 | 3 | 18.6 | 0.577| 0.031 |
答案 0 :(得分:0)
我找到了一种方法来实现我想要的大部分但是它很丑陋而且很慢。希望有人可以将此作为开发更有用的东西的起点:
group_num = 0
cached_future_time = None
def group_by_time(df, ind):
global group_num
global cached_future_time
curr_time = ind
future_time = ind + datetime.timedelta(minutes=10)
# Assume records are sorted chronologically ascending for this to work.
end = df.index.get_loc(future_time, method='pad')
start = df.index.get_loc(curr_time)
num_records = end - start
if cached_future_time is not None and curr_time < cached_future_time:
pass
elif cached_future_time is not None and curr_time >= cached_future_time:
group_num += 1
# Only increase the cached_future_time mark if we have sufficient data points to make this group useful.
if num_records >= 30:
cached_future_time = future_time
elif cached_future_time is None:
cached_future_time = future_time
return group_num
grp = df.groupby(lambda x: group_by_time(df, x))
修改强>
好的,我发现了一个更多的Pandas-ic方式,这也比上面的丑陋循环快得多。我在上面的回答中的缺点是认为我需要完成大部分工作来计算groupby函数中的组(并且认为没有办法在所有行中智能地应用这样的方法)。
# Add 10min to our timestamp and shift the values in that column 30 records
# into the future. We can then find all the timestamps that are 30 records
# newer but still within 10min of the original timestamp (ensuring that we have a 10min group with
# at least 30 records).
records["future"] = records["timestamp"] + datetime.timedelta(minutes=10)
starts = list(records[(records["timestamp"] <= records.future.shift(30)) & records.group_num.isnull()].index)
group_num = 1
# For each of those starting timestamps, grab a slice up to 10min in the future
# and apply a group number.
for start in starts:
group = records.loc[start:start + datetime.timedelta(minutes=10), 'group_num']
if len(group[group.isnull()]) >= 30:
# Only apply group_num to null values so that we get disjoint groups (no overlaps).
group[group.isnull()] = group_num
group_num += 1