Pandas groupby:多列中每个周期最多3个

时间:2017-12-15 09:31:27

标签: python pandas dataframe group-by pandas-groupby

我有这些数据:

                     val1  val2  val3
dt                                   
2017-12-15 00:00:00    81    90    79
2017-12-15 00:01:00    67    85    80
2017-12-15 00:02:00     4    41    37
2017-12-15 00:03:00    61    68    29
2017-12-15 00:04:00    49     6    56
2017-12-15 00:05:00    94    13    93
2017-12-15 00:06:00    91     3    75
2017-12-15 00:07:00    94    81     7
2017-12-15 00:08:00    55    59    33
2017-12-15 00:09:00    97    89    26
2017-12-15 00:10:00    17    75    88
2017-12-15 00:11:00    39    40    96
2017-12-15 00:12:00    61    20    70
2017-12-15 00:13:00    62    31    93
2017-12-15 00:14:00     7    26    29

我想找到每5分钟 3个最大值。 最大值可以在任何列(val1,val2,val3)中,并且必须在5分钟内可用的15个值中搜索。

目前我只能在一列中找到最大的。 是否可以在多列中搜索nlargest?

这是生成数据并搜索val1的最大值的代码:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

date_ref = datetime(2017, 12, 15, 0,0,0)
dtime = pd.date_range(date_ref, freq='1min', periods=15)

np.random.seed(seed=1115)
data1 = np.random.randint(1, high=100, size=len(dtime))
data2 = np.random.randint(1, high=100, size=len(dtime))
data3 = np.random.randint(1, high=100, size=len(dtime))

df = pd.DataFrame({'dt': dtime, 'val1': data1, 'val2': data2, 'val3': data3})
df.set_index('dt', inplace=True)

print(df)

group = df.groupby(pd.Grouper(freq='5min'))

max_only_for_val1 = (pd.DataFrame(
        group["val1"]
        .nlargest(3))
        .reset_index(level=1, drop=True)
    )

print(max_only_for_val1)

这是输出:

                     val1
dt                       
2017-12-15 00:00:00    81
2017-12-15 00:00:00    67
2017-12-15 00:00:00    61
2017-12-15 00:05:00    97
2017-12-15 00:05:00    94
2017-12-15 00:05:00    94
2017-12-15 00:10:00    62
2017-12-15 00:10:00    61
2017-12-15 00:10:00    39

1 个答案:

答案 0 :(得分:1)

由于你的价值来自何处并不重要,让我们重新整理你的数据。

df = df.reset_index().melt('dt').drop('variable', 1)
df.head(10)

                   dt  value
0 2017-12-15 00:00:00     81
1 2017-12-15 00:01:00     67
2 2017-12-15 00:02:00      4
3 2017-12-15 00:03:00     61
4 2017-12-15 00:04:00     49
5 2017-12-15 00:05:00     94
6 2017-12-15 00:06:00     91
7 2017-12-15 00:07:00     94
8 2017-12-15 00:08:00     55
9 2017-12-15 00:09:00     97

现在,请致电groupby + apply -

def get_max3(x):
    return x.sort_values(ascending=False).head(3)

df = df.groupby(pd.Grouper(key='dt', freq='5min'))['value']\
       .apply(get_max3)\
       .reset_index(0)\
       .reset_index(drop=True)

                   dt  value
0 2017-12-15 00:00:00     90
1 2017-12-15 00:00:00     85
2 2017-12-15 00:00:00     81
3 2017-12-15 00:05:00     97
4 2017-12-15 00:05:00     94
5 2017-12-15 00:05:00     94
6 2017-12-15 00:10:00     96
7 2017-12-15 00:10:00     93
8 2017-12-15 00:10:00     88

使用get_max3 -

numpy.sort的替代定义
def get_max3(x):
    return np.sort(x.values)[-4::-1]