大熊猫累计加权平均值

时间:2018-03-30 11:46:29

标签: python pandas numpy

 df :
   val       wt  
1  100       2
2  300       3
3  200       5

required df :

   val       wt  cum_wt_avg
1  100       2     100
2  300       3     220
3  200       5     210

公式:

  

cum_wt_avg [i] = cum_sum(val * wt)[i] / cum_sum(weight)[i]

有没有简单的方法在熊猫或numpy中做到这一点?  像这样的东西

 df["cum_wt_avg"] = pd.cum_mean(value=df.val, weight=df.wt)

2 个答案:

答案 0 :(得分:0)

我认为在熊猫中最好避免循环。

首先按mul分列多个列,得到cumsum并除以cumsum ed列wt

df["cum_wt_avg"] = df['val'].mul(df['wt']).cumsum().div(df['wt'].cumsum())
print (df)
   val  wt  cum_wt_avg
1  100   2       100.0
2  300   3       220.0
3  200   5       210.0

要提高效果,请numpy使用numpy.cumsum

import numpy as np

a = df['val'].values
b = df['wt'].values
df["cum_wt_avg"] = np.cumsum(a * b) / np.cumsum(b)

<强>计时

import numpy as np
from numba import jit

df = pd.concat([df]*1000)

#jpp solution
@jit(nopython=True)
def cum_wavg(arr, res):
    return np.cumsum(arr[:, 0] * arr[:, 1])/ np.cumsum(arr[:, 1])

def jez1(df):
    a = df['val'].values
    b = df['wt'].values
    return np.cumsum(a * b) / np.cumsum(b)

print (jez1(df))

In [184]: %timeit cum_wavg(df.values, res=np.zeros(len(df.index)))
65.5 µs ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [185]: %timeit df['val'].mul(df['wt']).cumsum().div(df['wt'].cumsum())
362 µs ± 6.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [186]: %timeit (jez1(df))
63.8 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

答案 1 :(得分:0)

这是使用numpy的一种方法。

import numpy as np

def cum_wavg(arr):
    return [np.average(arr[:i+1, 0], weights=arr[:i+1, 1]) for i in range(arr.shape[0])]

df['cum_wavg'] = cum_wavg(df.values)

为了获得更好的效果,您可以使用numba

import numpy as np
from numba import jit

df = pd.concat([df]*1000)

@jit(nopython=True)
def cum_wavg(arr, res):
    return np.cumsum(arr[:, 0] * arr[:, 1])/ np.cumsum(arr[:, 1])

%timeit cum_wavg(df.values, res=np.zeros(len(df.index)))           # 92.9 µs
%timeit df['val'].mul(df['wt']).cumsum().div(df['wt'].cumsum())    # 549 µs