我有下表。我想根据下面的公式计算按每个日期分组的加权平均值。我可以使用一些标准的传统代码来做到这一点,但假设这些数据在pandas数据框中,是否有更简单的方法来实现这一点,而不是通过迭代?
Date ID wt value w_avg
01/01/2012 100 0.50 60 0.791666667
01/01/2012 101 0.75 80
01/01/2012 102 1.00 100
01/02/2012 201 0.50 100 0.722222222
01/02/2012 202 1.00 80
01/01/2012 w_avg = 0.5 *(60 / sum(60,80,100))+ .75 *(80 / 总和(60,80,100))+ 1.0 *(100 /总和(60,80,100))
01/02/2012 w_avg = 0.5 *(100 / sum(100,80))+ 1.0 *(80 / 总和(100,80))
答案 0 :(得分:19)
我想我会用两个小组来做这件事。
首先计算"加权平均值":
In [11]: g = df.groupby('Date')
In [12]: df.value / g.value.transform("sum") * df.wt
Out[12]:
0 0.125000
1 0.250000
2 0.416667
3 0.277778
4 0.444444
dtype: float64
如果将其设置为列,则可以将其分组:
In [13]: df['wa'] = df.value / g.value.transform("sum") * df.wt
现在,此列的总和是所需的:
In [14]: g.wa.sum()
Out[14]:
Date
01/01/2012 0.791667
01/02/2012 0.722222
Name: wa, dtype: float64
或可能:
In [15]: g.wa.transform("sum")
Out[15]:
0 0.791667
1 0.791667
2 0.791667
3 0.722222
4 0.722222
Name: wa, dtype: float64
答案 1 :(得分:15)
让我们首先创建示例pandas dataframe:
public static void main(String[] args) {
for (int j = 32; j < 122; j++) {
// print 10 times the same char in the same line
for (int i=0;i<=10;i++){
System.out.print((char) j);
}
// after 10 char : goto next line
System.out.println();
}
}
然后,&#39; wt&#39;的平均值。加权值&#39;并按索引分组获得:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: index = pd.Index(['01/01/2012','01/01/2012','01/01/2012','01/02/2012','01/02/2012'], name='Date')
In [4]: df = pd.DataFrame({'ID':[100,101,102,201,202],'wt':[.5,.75,1,.5,1],'value':[60,80,100,100,80]},index=index)
或者,也可以定义一个函数:
In [5]: df.groupby(df.index).apply(lambda x: np.average(x.wt, weights=x.value))
Out[5]:
Date
01/01/2012 0.791667
01/02/2012 0.722222
dtype: float64
答案 2 :(得分:6)
我将表保存在.csv文件中
df=pd.read_csv('book1.csv')
grouped=df.groupby('Date')
g_wavg= lambda x: np.average(x.wt, weights=x.value)
grouped.apply(g_wavg)
答案 3 :(得分:4)
我觉得以下是这个问题的优雅解决方案:(Pandas DataFrame aggregate function using multiple columns)
private void mandelbrot() // calculate all points
{
HSBColor hsbcolor = new HSBColor();
hsbcolor.FromHSB(h, 0.8f, b);
}
答案 4 :(得分:0)
如果速度对您来说是重要因素,那么矢量化至关重要。因此,基于the answer by Andy Hayden,这是仅使用Pandas本机函数的解决方案:
Status_S_not_P_or_T =
COUNTROWS (
FILTER (
'Order',
VAR Statuses =
CALCULATETABLE ( VALUES ( Transactions[Status] ) )
RETURN
"S" IN Statuses &&
ISEMPTY ( INTERSECT ( { "P", "T" }, Statuses ) )
)
)
相比之下,使用自定义def weighted_mean(df, values, weights, groupby):
df = df.copy()
grouped = df.groupby(groupby)
df['weighted_average'] = df[values] / grouped[weights].transform('sum') * df[weights]
return grouped['weighted_average'].sum(min_count=1) #min_count is required for Grouper objects
函数的代码更少,但是速度更慢:
lambda
速度测试:
import numpy as np
def weighted_mean_by_lambda(df, values, weights, groupby):
return df.groupby(groupby).apply(lambda x: np.average(x[values], weights=x[weights]))
速度测试输出:
import time
import numpy as np
import pandas as pd
n = 100000000
df = pd.DataFrame({
'values': np.random.uniform(0, 1, size=n),
'weights': np.random.randint(0, 5, size=n),
'groupby': np.random.randint(0, 10000, size=n),
})
time1 = time.time()
weighted_mean(df, 'values', 'weights', 'groupby')
print('Time for `weighted_mean`:', time.time() - time1)
time2 = time.time()
weighted_mean_by_lambda(df, 'values', 'weights', 'groupby')
print('Time for `weighted_mean_by_lambda`:', time.time() - time2)