熊猫新手,我正试图获得一个固定窗口大小的滚动平均值。但我有2个列表,表示时间戳元组和值。我希望前者被用作后者的重量。我还想确保数据中的差距是可识别的(时间戳不一定是连续的)。
示例列表:
ts = [(1415969999, 1415970014), (1415970014, 1415970030), (1415970030, 1415970045), (1415970045, 1415970060), (1415970060, 1415970075), (1415970075, 1415970090), (1415970090, 1415970105), (1415970105, 1415970120), (1415970120, 1415970135), (1415970135, 1415970150), (1415970150, 1415970165), (1415970165, 1415970181), (1415970181, 1415970286), (1415970286, 1415970301), (1415970301, 1415970316)...]
values = [8.0, 13.0, 11.75, 7.0, 8.5, 16.0, 16.0, 6.5, 4.0, 8.25, 5.5, 1.0, 0.0, 0.5, 0.5, 0.0, 0.25, 0.0, 0.25, 0.0, 0.5, 0.0, 2.25, 0.0, 0.25, 0.0, 0.25, 0.0, 1.0, 0.25, 0.25, 0.0, 0.25, 0.0, 0.5, 0.25, 0.0, 1.0, 0.0, 0.5...]
我现在正在使用的是:
pandas_series = pd.Series(values)
window_averages = pd.rolling_mean(pandas_series, window=90) # 90 would be seconds here
但这没有考虑到重量。我看了here和here,但不能把它拼凑起来。
修改
我设法得到了我想要的东西,但我不认为解决方案是最优的。它显示了我在底部需要的输入,并且包括数据中的间隙(我现在用-1表示)
import pandas as pd
data = [(1415970014, 1415970030, 13.0), (1415970033, 1415970048, 11.75), (1415970048, 1415970053, 3.2)]
start_range = data[0][0]
end_range = data[len(data)-1][1]-1
previous_end_time = start_range
values = []
for t in data:
start_ts, end_ts, value = t
empties = []
while start_ts > previous_end_time:
empties.append(previous_end_time)
values.append(-1)
previous_end_time += 1
window_length = end_ts-start_ts
values += [value]*window_length
previous_end_time = end_ts
s_range_datetime_start = pd.to_datetime(start_range, unit='s')
s_range_datetime_end = pd.to_datetime(end_range, unit='s')
period_range = pd.period_range(s_range_datetime_start, s_range_datetime_end, freq='s')
series = pd.Series(values, period_range)
print series
然后产生以下结果,基本上将数据外推1秒。
2014-11-14 13:00:14 13.00
2014-11-14 13:00:15 13.00
2014-11-14 13:00:16 13.00
2014-11-14 13:00:17 13.00
2014-11-14 13:00:18 13.00
2014-11-14 13:00:19 13.00
2014-11-14 13:00:20 13.00
2014-11-14 13:00:21 13.00
2014-11-14 13:00:22 13.00
2014-11-14 13:00:23 13.00
2014-11-14 13:00:24 13.00
2014-11-14 13:00:25 13.00
2014-11-14 13:00:26 13.00
2014-11-14 13:00:27 13.00
2014-11-14 13:00:28 13.00
2014-11-14 13:00:29 13.00
2014-11-14 13:00:30 -1.00
2014-11-14 13:00:31 -1.00
2014-11-14 13:00:32 -1.00
2014-11-14 13:00:33 11.75
2014-11-14 13:00:34 11.75
2014-11-14 13:00:35 11.75
2014-11-14 13:00:36 11.75
2014-11-14 13:00:37 11.75
2014-11-14 13:00:38 11.75
2014-11-14 13:00:39 11.75
2014-11-14 13:00:40 11.75
2014-11-14 13:00:41 11.75
2014-11-14 13:00:42 11.75
2014-11-14 13:00:43 11.75
2014-11-14 13:00:44 11.75
2014-11-14 13:00:45 11.75
2014-11-14 13:00:46 11.75
2014-11-14 13:00:47 11.75
2014-11-14 13:00:48 3.20
2014-11-14 13:00:49 3.20
2014-11-14 13:00:50 3.20
2014-11-14 13:00:51 3.20
2014-11-14 13:00:52 3.20
我的想法是在这个时间段应用滚动均值。
答案 0 :(得分:2)
首先打包数据
In [26]: df = DataFrame(ts)
In [27]: df.columns=['start','end']
你的价值观太长了(显示的内容)
In [28]: df['value'] = values[:len(df)]
In [29]: df
Out[29]:
start end value
0 1415969999 1415970014 8.00
1 1415970014 1415970030 13.00
2 1415970030 1415970045 11.75
3 1415970045 1415970060 7.00
4 1415970060 1415970075 8.50
5 1415970075 1415970090 16.00
6 1415970090 1415970105 16.00
7 1415970105 1415970120 6.50
8 1415970120 1415970135 4.00
9 1415970135 1415970150 8.25
10 1415970150 1415970165 5.50
11 1415970165 1415970181 1.00
12 1415970181 1415970286 0.00
13 1415970286 1415970301 0.50
14 1415970301 1415970316 0.50
将时间戳记设为实际日期时间
In [30]: df['start'] = pd.to_datetime(df['start'],unit='s')
In [31]: df['end'] = pd.to_datetime(df['end'],unit='s')
听起来你想在90年代的窗口中对所有内容进行重新采样。
In [32]: df.groupby(pd.Grouper(key='start',freq='90s'))['value'].mean()
Out[32]:
start
2014-11-14 12:58:30 8.000
2014-11-14 13:00:00 11.250
2014-11-14 13:01:30 6.875
2014-11-14 13:03:00 0.000
2014-11-14 13:04:30 0.500
Freq: 90S, Name: value, dtype: float64
通过加权数据不确定您的意思。请提供另一个输出示例。