pandas groupby: TOP 3 values in each group and store in DataFrame中发布了一个新的更通用的问题,并在那里得到了解决方案。
在这个例子中,我创建了一个数据帧df
,其中一些随机数据间隔5分钟。
我想创建一个数据框gdf
(分组df ),其中列出了每小时 3个最高值。
即:来自这一系列值
VAL
TIME
2017-12-08 00:00:00 29
2017-12-08 00:05:00 56
2017-12-08 00:10:00 82
2017-12-08 00:15:00 13
2017-12-08 00:20:00 35
2017-12-08 00:25:00 53
2017-12-08 00:30:00 25
2017-12-08 00:35:00 23
2017-12-08 00:40:00 21
2017-12-08 00:45:00 12
2017-12-08 00:50:00 15
2017-12-08 00:55:00 9
2017-12-08 01:00:00 13
2017-12-08 01:05:00 87
2017-12-08 01:10:00 9
2017-12-08 01:15:00 63
2017-12-08 01:20:00 62
2017-12-08 01:25:00 52
2017-12-08 01:30:00 43
2017-12-08 01:35:00 77
2017-12-08 01:40:00 95
2017-12-08 01:45:00 79
2017-12-08 01:50:00 77
2017-12-08 01:55:00 5
2017-12-08 02:00:00 78
2017-12-08 02:05:00 41
2017-12-08 02:10:00 10
2017-12-08 02:15:00 10
2017-12-08 02:20:00 88
我非常接近解决方案,但我找不到最后一步的正确语法。我现在所做的事情(largest3
)是:
VAL
TIME TIME
2017-12-08 00:00:00 2017-12-08 00:10:00 82
2017-12-08 00:05:00 56
2017-12-08 00:25:00 53
2017-12-08 01:00:00 2017-12-08 01:40:00 95
2017-12-08 01:05:00 87
2017-12-08 01:45:00 79
2017-12-08 02:00:00 2017-12-08 02:20:00 88
2017-12-08 02:00:00 78
2017-12-08 02:05:00 41
我希望从中获得gdf
(达到每个最大值的时间并不重要):
VAL1 VAL2 VAL3
TIME
2017-12-08 00:00:00 82 56 53
2017-12-08 01:00:00 95 87 79
2017-12-08 02:00:00 88 78 41
这是代码:
import pandas as pd
from datetime import *
import numpy as np
# test data
df = pd.DataFrame()
date_ref = datetime(2017,12,8,0,0,0)
days = pd.date_range(date_ref, date_ref + timedelta(0.1), freq='5min')
np.random.seed(seed=1111)
data1 = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'TIME': days, 'VAL': data1})
df = df.set_index('TIME')
print(df)
print("----")
# groupby
group1 = df.groupby(pd.Grouper(freq='1H'))
largest3 = pd.DataFrame(group1['VAL'].nlargest(3))
print(largest3)
gdf = pd.DataFrame()
# ???? <-------------------
提前谢谢。
答案 0 :(得分:4)
注意:此解决方案仅适用于每个组至少有3行
的情况尝试以下方法:
In [59]: x = (df.groupby(pd.Grouper(freq='H'))['VAL']
.apply(lambda x: x.nlargest(3))
.reset_index(level=1, drop=True)
.to_frame('VAL'))
In [60]: x
Out[60]:
VAL
TIME
2017-12-08 00:00:00 82
2017-12-08 00:00:00 56
2017-12-08 00:00:00 53
2017-12-08 01:00:00 95
2017-12-08 01:00:00 87
2017-12-08 01:00:00 79
2017-12-08 02:00:00 88
2017-12-08 02:00:00 78
2017-12-08 02:00:00 41
In [61]: x.set_index(np.arange(len(x)) % 3, append=True)['VAL'].unstack().add_prefix('VAL')
Out[61]:
VAL0 VAL1 VAL2
TIME
2017-12-08 00:00:00 82 56 53
2017-12-08 01:00:00 95 87 79
2017-12-08 02:00:00 88 78 41
一些解释:
In [94]: x.set_index(np.arange(len(x)) % 3, append=True)
Out[94]:
VAL
TIME
2017-12-08 00:00:00 0 82
1 56
2 53
2017-12-08 01:00:00 0 95
1 87
2 79
2017-12-08 02:00:00 0 88
1 78
2 41
In [95]: x.set_index(np.arange(len(x)) % 3, append=True)['VAL'].unstack()
Out[95]:
0 1 2
TIME
2017-12-08 00:00:00 82 56 53
2017-12-08 01:00:00 95 87 79
2017-12-08 02:00:00 88 78 41