Question

我有以下数据框：

date,       industry, symbol, roc
25-02-2015, Health,   abc,    200
25-02-2015, Health,   xyz,    150
25-02-2015, Mining,   tyr,    45
25-02-2015, Mining,   ujk,    70
26-02-2015, Health,   abc,    60
26-02-2015, Health,   xyz,    310
26-02-2015, Mining,   tyr,    65
26-02-2015, Mining,   ujk,    23

我需要确定普通＆＃39; roc＆＃39;，max＆＃39; roc＆＃39; min＆＃39; roc＆＃39;以及每个日期+行业存在多少个符号。换句话说，我需要按日期和行业分组，然后确定各种平均值，最大/最小等等。

到目前为止，我正在做以下工作，这项工作正在进行，但似乎非常缓慢且效率低下：

sector_df = primary_df.groupby(['date', 'industry'], sort=True).mean()
tmp_max_df = primary_df.groupby(['date', 'industry'], sort=True).max()
tmp_min_df = primary_df.groupby(['date', 'industry'], sort=True).min()
tmp_count_df = primary_df.groupby(['date', 'industry'], sort=True).count()
sector_df['max_roc'] = tmp_max_df['roc']
sector_df['min_roc'] = tmp_min_df['roc']
sector_df['count'] = tmp_count_df['roc']
sector_df.reset_index(inplace=True)
sector_df.set_index(['date', 'industry'], inplace=True)

以上代码有效，导致按日期+行业编制索引的数据框，向我显示最小值/最大值＆＃39; roc＆＃39;对于每个日期+行业，以及每个日期+行业存在多少个符号。

我基本上是多次完成一个完整的组（以确定＆＃39; roc＆＃39;的平均值，最大值，最小值，计数值）。这非常缓慢，因为它一遍又一遍地做同样的事情。

有没有办法一次性完成这个小组。然后对该对象执行mean，max等，并将结果分配给sector_df？

Answer 1

您想使用agg执行汇总：

In [72]:

df.groupby(['date','industry']).agg([pd.Series.mean, pd.Series.max, pd.Series.min, pd.Series.count])
Out[72]:
                       roc                
                      mean  max  min count
date       industry                       
2015-02-25 Health    175.0  200  150     2
           Mining     57.5   70   45     2
2015-02-26 Health    185.0  310   60     2
           Mining     44.0   65   23     2

这允许您传递要执行的函数的iterable（在本例中为列表）。

修改

要访问单个结果，您需要为每个轴传递一个元组：

In [78]: gp.loc[('2015-02-25','Health'),('roc','mean')] Out[78]: 175.0

gp = df.groupby(['date','industry']).agg([pd.Series.mean, pd.Series.max, pd.Series.min, pd.Series.count])

Answer 2

您可以将groupby部分保存到变量，如下所示：

primary_df = pd.DataFrame([['25-02-2015', 'Health', 'abc', 200],
                   ['25-02-2015', 'Health', 'xyz', 150],
                   ['25-02-2015', 'Mining',  'tyr', 45],
                   ['25-02-2015', 'Mining', 'ujk', 70], 
                   ['26-02-2015', 'Health', 'abc', 60],
                   ['26-02-2015', 'Health', 'xyz', 310],
                   ['26-02-2015', 'Mining',  'tyr', 65],
                   ['26-02-2015', 'Mining', 'ujk', 23]], 
                  columns='date industry symbol roc'.split())

grouped = primary_df.groupby(['date', 'industry'], sort=True)
sector_df = grouped.mean()
tmp_max_df = grouped.max()
tmp_min_df = grouped.min()
tmp_count_df = grouped.count()

sector_df['max_roc'] = tmp_max_df['roc']
sector_df['min_roc'] = tmp_min_df['roc']
sector_df['count'] = tmp_count_df['roc']
sector_df.reset_index(inplace=True)
sector_df.set_index(['date', 'industry'], inplace=True)

分配pandas groupby的结果

2 个答案: