Question

我有一个时间序列，样本频率不规则。为了获得这方面的有用数据，我需要找到10分钟的周期，大致均匀间隔的样本（我已经定义了2个样本之间的平均时间δ小于20秒）。

示例数据： （为了这个例子，我将用avg 2s增量10秒间隔。）

| timestamp             | speed |
| 2010-01-01 09:20:12   | 10    |
| 2010-01-01 09:20:14   | 14    |
| 2010-01-01 09:20:16   | 12    |
| 2010-01-01 09:20:27   | 18    |
| 2010-01-01 09:20:28   | 19    |
| 2010-01-01 09:20:29   | 19    |

我希望的结果是如下的分组。请注意，第二组不包含在内，因为样本在10s周期结束时聚集在一起（27,28,29），这意味着7s的隐含额外时间间隔使得平均增量3s。

| timestamp             | avg   | std  | std_over_avg |
| 2010-01-01 09:20:10   | 12    | 1.63 | 0.136        |

修改我想我在我的问题中组合了多个东西（有些不正确），所以我想纠正/澄清我在寻找的东西。

回到示例数据，我想把它分成10个不规则的peiords;也就是说，如果存在数据间隙，则下一个10s周期应该从下一个可行的rcord的时间戳开始。（请忽略之前提到的均匀间隔的样本，结果我误解了这个要求，如果需要，我总是可以在以后的阶段过滤掉它）。所以我想要这样的东西：

| period                                     | count | avg   |  std | std_over_avg |
| 2010-01-01 09:20:12 - 2010-01-01 09:20:22  | 3     | 12    | 1.63 | 0.136        |
| 2010-01-01 09:20:27 - 2010-01-01 09:20:37  | 3     | 18.6  | 0.577| 0.031        |

Answer 1

我找到了一种方法来实现我想要的大部分但是它很丑陋而且很慢。希望有人可以将此作为开发更有用的东西的起点：

group_num = 0
cached_future_time = None
def group_by_time(df, ind):
    global group_num 
    global cached_future_time  
    curr_time = ind
    future_time = ind + datetime.timedelta(minutes=10)
    # Assume records are sorted chronologically ascending for this to work.    
    end = df.index.get_loc(future_time, method='pad')  
    start = df.index.get_loc(curr_time)
    num_records = end - start
    if cached_future_time is not None and curr_time < cached_future_time:
        pass
    elif cached_future_time is not None and curr_time >= cached_future_time:
        group_num += 1
        # Only increase the cached_future_time mark if we have sufficient data points to make this group useful.
        if num_records >= 30:
            cached_future_time = future_time
    elif cached_future_time is None:
        cached_future_time = future_time
    return group_num

grp = df.groupby(lambda x: group_by_time(df, x))

修改

好的，我发现了一个更多的Pandas-ic方式，这也比上面的丑陋循环快得多。我在上面的回答中的缺点是认为我需要完成大部分工作来计算groupby函数中的组（并且认为没有办法在所有行中智能地应用这样的方法）。

# Add 10min to our timestamp and shift the values in that column 30 records # into the future. We can then find all the timestamps that are 30 records # newer but still within 10min of the original timestamp (ensuring that we have a 10min group with # at least 30 records). records["future"] = records["timestamp"] + datetime.timedelta(minutes=10) starts = list(records[(records["timestamp"] <= records.future.shift(30)) & records.group_num.isnull()].index) group_num = 1 # For each of those starting timestamps, grab a slice up to 10min in the future # and apply a group number. for start in starts: group = records.loc[start:start + datetime.timedelta(minutes=10), 'group_num'] if len(group[group.isnull()]) >= 30: # Only apply group_num to null values so that we get disjoint groups (no overlaps). group[group.isnull()] = group_num group_num += 1

熊猫组时间从不同时间开始

1 个答案: