用时间加权值计算滚动平均大熊猫

时间:2014-11-15 16:48:39

标签: python numpy pandas

熊猫新手,我正试图获得一个固定窗口大小的滚动平均值。但我有2个列表,表示时间戳元组和值。我希望前者被用作后者的重量。我还想确保数据中的差距是可识别的(时间戳不一定是连续的)。

示例列表:

ts = [(1415969999, 1415970014), (1415970014, 1415970030), (1415970030, 1415970045), (1415970045, 1415970060), (1415970060, 1415970075), (1415970075, 1415970090), (1415970090, 1415970105), (1415970105, 1415970120), (1415970120, 1415970135), (1415970135, 1415970150), (1415970150, 1415970165), (1415970165, 1415970181), (1415970181, 1415970286), (1415970286, 1415970301), (1415970301, 1415970316)...]

values = [8.0, 13.0, 11.75, 7.0, 8.5, 16.0, 16.0, 6.5, 4.0, 8.25, 5.5, 1.0, 0.0, 0.5, 0.5, 0.0, 0.25, 0.0, 0.25, 0.0, 0.5, 0.0, 2.25, 0.0, 0.25, 0.0, 0.25, 0.0, 1.0, 0.25, 0.25, 0.0, 0.25, 0.0, 0.5, 0.25, 0.0, 1.0, 0.0, 0.5...]

我现在正在使用的是:

pandas_series = pd.Series(values) window_averages = pd.rolling_mean(pandas_series, window=90) # 90 would be seconds here

但这没有考虑到重量。我看了herehere,但不能把它拼凑起来。

修改

我设法得到了我想要的东西,但我不认为解决方案是最优的。它显示了我在底部需要的输入,并且包括数据中的间隙(我现在用-1表示)

import pandas as pd

data = [(1415970014, 1415970030, 13.0), (1415970033, 1415970048, 11.75), (1415970048, 1415970053, 3.2)]
start_range = data[0][0]
end_range = data[len(data)-1][1]-1
previous_end_time = start_range
values = []

for t in data:
    start_ts, end_ts, value = t

    empties = []
    while start_ts > previous_end_time:
        empties.append(previous_end_time)
        values.append(-1)
        previous_end_time += 1

    window_length = end_ts-start_ts
    values += [value]*window_length
    previous_end_time = end_ts

s_range_datetime_start = pd.to_datetime(start_range, unit='s')
s_range_datetime_end = pd.to_datetime(end_range, unit='s')
period_range = pd.period_range(s_range_datetime_start, s_range_datetime_end, freq='s')

series = pd.Series(values, period_range)
print series

然后产生以下结果,基本上将数据外推1秒。

2014-11-14 13:00:14    13.00
2014-11-14 13:00:15    13.00
2014-11-14 13:00:16    13.00
2014-11-14 13:00:17    13.00
2014-11-14 13:00:18    13.00
2014-11-14 13:00:19    13.00
2014-11-14 13:00:20    13.00
2014-11-14 13:00:21    13.00
2014-11-14 13:00:22    13.00
2014-11-14 13:00:23    13.00
2014-11-14 13:00:24    13.00
2014-11-14 13:00:25    13.00
2014-11-14 13:00:26    13.00
2014-11-14 13:00:27    13.00
2014-11-14 13:00:28    13.00
2014-11-14 13:00:29    13.00
2014-11-14 13:00:30    -1.00
2014-11-14 13:00:31    -1.00
2014-11-14 13:00:32    -1.00
2014-11-14 13:00:33    11.75
2014-11-14 13:00:34    11.75
2014-11-14 13:00:35    11.75
2014-11-14 13:00:36    11.75
2014-11-14 13:00:37    11.75
2014-11-14 13:00:38    11.75
2014-11-14 13:00:39    11.75
2014-11-14 13:00:40    11.75
2014-11-14 13:00:41    11.75
2014-11-14 13:00:42    11.75
2014-11-14 13:00:43    11.75
2014-11-14 13:00:44    11.75
2014-11-14 13:00:45    11.75
2014-11-14 13:00:46    11.75
2014-11-14 13:00:47    11.75
2014-11-14 13:00:48     3.20
2014-11-14 13:00:49     3.20
2014-11-14 13:00:50     3.20
2014-11-14 13:00:51     3.20
2014-11-14 13:00:52     3.20

我的想法是在这个时间段应用滚动均值。

1 个答案:

答案 0 :(得分:2)

首先打包数据

In [26]: df = DataFrame(ts)

In [27]: df.columns=['start','end']

你的价值观太长了(显示的内容)

In [28]: df['value'] = values[:len(df)]

In [29]: df
Out[29]: 
         start         end  value
0   1415969999  1415970014   8.00
1   1415970014  1415970030  13.00
2   1415970030  1415970045  11.75
3   1415970045  1415970060   7.00
4   1415970060  1415970075   8.50
5   1415970075  1415970090  16.00
6   1415970090  1415970105  16.00
7   1415970105  1415970120   6.50
8   1415970120  1415970135   4.00
9   1415970135  1415970150   8.25
10  1415970150  1415970165   5.50
11  1415970165  1415970181   1.00
12  1415970181  1415970286   0.00
13  1415970286  1415970301   0.50
14  1415970301  1415970316   0.50

将时间戳记设为实际日期时间

In [30]: df['start'] = pd.to_datetime(df['start'],unit='s')

In [31]: df['end'] = pd.to_datetime(df['end'],unit='s')

听起来你想在90年代的窗口中对所有内容进行重新采样。

In [32]: df.groupby(pd.Grouper(key='start',freq='90s'))['value'].mean()
Out[32]: 
start
2014-11-14 12:58:30     8.000
2014-11-14 13:00:00    11.250
2014-11-14 13:01:30     6.875
2014-11-14 13:03:00     0.000
2014-11-14 13:04:30     0.500
Freq: 90S, Name: value, dtype: float64

通过加权数据不确定您的意思。请提供另一个输出示例。