我在.csv文件中有大量的时间序列数据集。文件中有两列:
values
:这些是示例值。dttm_utc
:这些是收集样本时的时间戳。我已使用pd.read_csv(..., parse_dates=["dttm_utc"])
将数据导入pandas。当我打印dttm_utc
列的前50行时,它们看起来像这样:
0 2012-01-01 00:05:00
1 2012-01-01 00:10:00
2 2012-01-01 00:15:00
3 2012-01-01 00:20:00
4 2012-01-01 00:25:00
5 2012-01-01 00:30:00
6 2012-01-01 00:35:00
7 2012-01-01 00:40:00
8 2012-01-01 00:45:00
9 2012-01-01 00:50:00
10 2012-01-01 00:55:00
11 2012-01-01 01:00:00
12 2012-01-01 01:05:00
13 2012-01-01 01:10:00
14 2012-01-01 01:15:00
15 2012-01-01 01:20:00
16 2012-01-01 01:25:00
17 2012-01-01 01:30:00
18 2012-01-01 01:35:00
19 2012-01-01 01:40:00
20 2012-01-01 01:45:00
21 2012-01-01 01:50:00
22 2012-01-01 01:55:00
23 2012-01-01 02:00:00
24 2012-01-01 02:05:00
25 2012-01-01 02:10:00
26 2012-01-01 02:15:00
27 2012-01-01 02:20:00
28 2012-01-01 02:25:00
29 2012-01-01 02:30:00
30 2012-01-01 02:35:00
31 2012-01-01 02:40:00
32 2012-01-01 02:45:00
33 2012-01-01 02:50:00
34 2012-01-01 02:55:00
35 2012-01-01 03:00:00
36 2012-01-01 03:05:00
37 2012-01-01 03:10:00
38 2012-01-01 03:15:00
39 2012-01-01 03:20:00
40 2012-01-01 03:25:00
41 2012-01-01 03:30:00
42 2012-01-01 03:35:00
43 2012-01-01 03:40:00
44 2012-01-01 03:45:00
45 2012-01-01 03:50:00
46 2012-01-01 03:55:00
47 2012-01-01 04:00:00
48 2012-01-01 04:05:00
49 2012-01-01 04:10:00
Name: dttm_utc, dtype: datetime64[ns]
现在,我想要实现的是:
现在,每隔5分钟就会进行一次采样,如果它改变了,比方说,每隔2或10分钟,我希望我的解决方案仍能正常工作。
答案 0 :(得分:2)
您的示例数据是Series
,但您的问题是询问行的值的求和和平均值,因此我不清楚您尝试求和的情况,并且没有示例数据。
我认为您感兴趣的是resampling
但这只能在日期时间列(dttm_utc
)位于索引中时才能完成。
s = pd.Series(pd.DatetimeIndex(start='2012-01-01 00:05:00', periods=50,
freq=pd.offsets.Minute(n=5)), name='dttm_utc')
s.reset_index().set_index('dttm_utc').resample(pd.offsets.Hour()).agg([np.sum, np.mean])
给你这个......但它是一个多指数,使事情变得更加复杂。
index
sum mean
dttm_utc
2012-01-01 00:00:00 55 5.0
2012-01-01 01:00:00 198 16.5
2012-01-01 02:00:00 342 28.5
2012-01-01 03:00:00 486 40.5
2012-01-01 04:00:00 144 48.0
如果要删除多索引(多级列),可以执行以下操作:
new_s = s.reset_index().set_index('dttm_utc').resample(pd.offsets.Hour()).agg([np.sum, np.mean])
new_s.columns = new_s.columns.droplevel(level=0)
sum mean
dttm_utc
2012-01-01 00:00:00 55 5.0
2012-01-01 01:00:00 198 16.5
2012-01-01 02:00:00 342 28.5
2012-01-01 03:00:00 486 40.5
2012-01-01 04:00:00 144 48.0
答案 1 :(得分:1)
# dataset imitation with samples in column 'data1'
df = pd.DataFrame({'dttm_utc': pd.date_range('1/1/2012', periods=50, freq=pd.offsets.Minute(n=5))})
df['data1'] = np.random.randint(0, 500, len(df))
In [308]:df
Out[308]:
data1
dttm_utc
2012-01-01 00:00:00 379
2012-01-01 00:05:00 387
2012-01-01 00:10:00 241
2012-01-01 00:15:00 197
...
# set column 'dttm_utc' as DatetimeIndex for downsampling to hours
In [309]: df.set_index('dttm_utc', inplace=True)
# hereinafter as from Jarad
In [310]: df.resample('H').agg([np.sum, np.mean])
Out[310]:
data1
sum mean
dttm_utc
2012-01-01 00:00:00 3007 250.583333
2012-01-01 01:00:00 2832 236.000000
2012-01-01 02:00:00 3177 264.750000
2012-01-01 03:00:00 3376 281.333333
2012-01-01 04:00:00 402 201.000000