熊猫组时间从不同时间开始

时间:2016-02-10 01:07:48

标签: python pandas

我有一个时间序列,样本频率不规则。为了获得这方面的有用数据,我需要找到10分钟的周期,大致均匀间隔的样本(我已经定义了2个样本之间的平均时间δ小于20秒)。

示例数据: (为了这个例子,我将用avg 2s增量10秒间隔。)

| timestamp             | speed |
| 2010-01-01 09:20:12   | 10    |
| 2010-01-01 09:20:14   | 14    |
| 2010-01-01 09:20:16   | 12    |
| 2010-01-01 09:20:27   | 18    |
| 2010-01-01 09:20:28   | 19    |
| 2010-01-01 09:20:29   | 19    |

我希望的结果是如下的分组。请注意,第二组不包含在内,因为样本在10s周期结束时聚集在一起(27,28,29),这意味着7s的隐含额外时间间隔使得平均增量3s。

| timestamp             | avg   | std  | std_over_avg |
| 2010-01-01 09:20:10   | 12    | 1.63 | 0.136        |



修改 我想我在我的问题中组合了多个东西(有些不正确),所以我想纠正/澄清我在寻找的东西。

回到示例数据,我想把它分成10个不规则的peiords;也就是说,如果存在数据间隙,则下一个10s周期应该从下一个可行的rcord的时间戳开始。 (请忽略之前提到的均匀间隔的样本,结果我误解了这个要求,如果需要,我总是可以在以后的阶段过滤掉它)。所以我想要这样的东西:

| period                                     | count | avg   |  std | std_over_avg |
| 2010-01-01 09:20:12 - 2010-01-01 09:20:22  | 3     | 12    | 1.63 | 0.136        |
| 2010-01-01 09:20:27 - 2010-01-01 09:20:37  | 3     | 18.6  | 0.577| 0.031        | 

1 个答案:

答案 0 :(得分:0)

我找到了一种方法来实现我想要的大部分但是它很丑陋而且很慢。希望有人可以将此作为开发更有用的东西的起点:

group_num = 0
cached_future_time = None
def group_by_time(df, ind):
    global group_num 
    global cached_future_time  
    curr_time = ind
    future_time = ind + datetime.timedelta(minutes=10)
    # Assume records are sorted chronologically ascending for this to work.    
    end = df.index.get_loc(future_time, method='pad')  
    start = df.index.get_loc(curr_time)
    num_records = end - start
    if cached_future_time is not None and curr_time < cached_future_time:
        pass
    elif cached_future_time is not None and curr_time >= cached_future_time:
        group_num += 1
        # Only increase the cached_future_time mark if we have sufficient data points to make this group useful.
        if num_records >= 30:
            cached_future_time = future_time
    elif cached_future_time is None:
        cached_future_time = future_time
    return group_num

grp = df.groupby(lambda x: group_by_time(df, x))

修改

好的,我发现了一个更多的Pandas-ic方式,这也比上面的丑陋循环快得多。我在上面的回答中的缺点是认为我需要完成大部分工作来计算groupby函数中的组(并且认为没有办法在所有行中智能地应用这样的方法)。

# Add 10min to our timestamp and shift the values in that column 30 records
# into the future. We can then find all the timestamps that are 30 records
# newer but still within 10min of the original timestamp (ensuring that we have a 10min group with
# at least 30 records).
records["future"] = records["timestamp"] + datetime.timedelta(minutes=10)
starts = list(records[(records["timestamp"] <= records.future.shift(30)) & records.group_num.isnull()].index)

group_num = 1
# For each of those starting timestamps, grab a slice up to 10min in the future
# and apply a group number. 
for start in starts:
    group = records.loc[start:start + datetime.timedelta(minutes=10), 'group_num']
    if len(group[group.isnull()]) >= 30:
        # Only apply group_num to null values so that we get disjoint groups (no overlaps).
        group[group.isnull()] = group_num
        group_num += 1