使用pandas / dataframe计算加权平均值

时间:2014-10-05 18:36:05

标签: python numpy pandas

我有下表。我想根据下面的公式计算按每个日期分组的加权平均值。我可以使用一些标准的传统代码来做到这一点,但假设这些数据在pandas数据框中,是否有更简单的方法来实现这一点,而不是通过迭代?

Date        ID      wt      value   w_avg
01/01/2012  100     0.50    60      0.791666667
01/01/2012  101     0.75    80
01/01/2012  102     1.00    100
01/02/2012  201     0.50    100     0.722222222
01/02/2012  202     1.00    80
  

01/01/2012 w_avg = 0.5 *(60 / sum(60,80,100))+ .75 *(80 /   总和(60,80,100))+ 1.0 *(100 /总和(60,80,100))

     

01/02/2012 w_avg = 0.5 *(100 / sum(100,80))+ 1.0 *(80 /   总和(100,80))

5 个答案:

答案 0 :(得分:19)

我想我会用两个小组来做这件事。

首先计算"加权平均值":

In [11]: g = df.groupby('Date')

In [12]: df.value / g.value.transform("sum") * df.wt
Out[12]:
0    0.125000
1    0.250000
2    0.416667
3    0.277778
4    0.444444
dtype: float64

如果将其设置为列,则可以将其分组:

In [13]: df['wa'] = df.value / g.value.transform("sum") * df.wt

现在,此列的总和是所需的:

In [14]: g.wa.sum()
Out[14]:
Date
01/01/2012    0.791667
01/02/2012    0.722222
Name: wa, dtype: float64

或可能:

In [15]: g.wa.transform("sum")
Out[15]:
0    0.791667
1    0.791667
2    0.791667
3    0.722222
4    0.722222
Name: wa, dtype: float64

答案 1 :(得分:15)

让我们首先创建示例pandas dataframe:

public static void main(String[] args) {
        for (int j = 32; j < 122; j++) {

            // print 10 times the same char in the same line
            for (int i=0;i<=10;i++){
                System.out.print((char) j);
            }
            // after 10 char : goto next line
            System.out.println();
        }
    }

然后,&#39; wt&#39;的平均值。加权值&#39;并按索引分组获得:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: index = pd.Index(['01/01/2012','01/01/2012','01/01/2012','01/02/2012','01/02/2012'], name='Date')

In [4]: df = pd.DataFrame({'ID':[100,101,102,201,202],'wt':[.5,.75,1,.5,1],'value':[60,80,100,100,80]},index=index)

或者,也可以定义一个函数:

In [5]: df.groupby(df.index).apply(lambda x: np.average(x.wt, weights=x.value))
Out[5]: 
Date
01/01/2012    0.791667
01/02/2012    0.722222
dtype: float64

答案 2 :(得分:6)

我将表保存在.csv文件中

df=pd.read_csv('book1.csv')

grouped=df.groupby('Date')
g_wavg= lambda x: np.average(x.wt, weights=x.value)
grouped.apply(g_wavg)

答案 3 :(得分:4)

我觉得以下是这个问题的优雅解决方案:(Pandas DataFrame aggregate function using multiple columns

 private void mandelbrot() // calculate all points
        {
            HSBColor hsbcolor = new HSBColor();
            hsbcolor.FromHSB(h, 0.8f, b);
        }

答案 4 :(得分:0)

如果速度对您来说是重要因素,那么矢量化至关重要。因此,基于the answer by Andy Hayden,这是仅使用Pandas本机函数的解决方案:

Status_S_not_P_or_T = 
COUNTROWS (
    FILTER (
        'Order',
        VAR Statuses =
            CALCULATETABLE ( VALUES ( Transactions[Status] ) )
        RETURN
            "S" IN Statuses &&
            ISEMPTY ( INTERSECT ( { "P", "T" }, Statuses ) )
    )
)

相比之下,使用自定义def weighted_mean(df, values, weights, groupby): df = df.copy() grouped = df.groupby(groupby) df['weighted_average'] = df[values] / grouped[weights].transform('sum') * df[weights] return grouped['weighted_average'].sum(min_count=1) #min_count is required for Grouper objects 函数的代码更少,但是速度更慢:

lambda

速度测试:

import numpy as np
def weighted_mean_by_lambda(df, values, weights, groupby):
    return df.groupby(groupby).apply(lambda x: np.average(x[values], weights=x[weights]))

速度测试输出:

import time
import numpy as np
import pandas as pd

n = 100000000

df = pd.DataFrame({
    'values': np.random.uniform(0, 1, size=n), 
    'weights': np.random.randint(0, 5, size=n),
    'groupby': np.random.randint(0, 10000, size=n), 
})

time1 = time.time()
weighted_mean(df, 'values', 'weights', 'groupby')
print('Time for `weighted_mean`:', time.time() - time1)

time2 = time.time()
weighted_mean_by_lambda(df, 'values', 'weights', 'groupby')
print('Time for `weighted_mean_by_lambda`:', time.time() - time2)