行数据框架熊猫的加权平均值

时间:2016-06-13 13:28:33

标签: python pandas

我希望以下内容更有效:

对于通过“名称”,“日期”,“时间”和其他指标变量“id”收集的数据,我想计算“值”列的每日加权平均值 ,使用“权重”列作为平均权重,“id”。原始数据的示例如下:

df = pd.DataFrame({"name":["A", "A", "A" ,"A", "A" ,"A", "B", "B", "B", "B"], "date":["06/24/2014","06/24/2014","06/24/2014","06/24/2014","06/25/2014","06/25/2014","06/25/2014","06/24/2014","06/24/2014","06/25/2014"], "time":['13:01:08', '13:46:53', '13:47:13', '13:49:11', '13:51:09', '14:35:03','15:35:00', '16:17:26', '16:17:26', '16:17:26'] , "id": ["B","B","S","S","S","B","S","B","S","S"], "value":[100.0, 98.0, 102.0, 80.0, 10.0, 200.0, 99.5, 10.0, 9.8, 10.0], "weights": [20835000.0, 3960000.0, 3960000.0, 3955000.0, 3960000.0, 5000000.0, 2000000.0, 6850.0, 162997.79999999999, 5000.0] })

应用此功能后,数据应只有“name”,“id”和“w_avg”列。

我使用groupby编写了以下函数:

df1 = df.groupby(['name','date','id'], as_index=False).apply(lambda x: np.average(x['value'], weights=x['weights'])).unstack()

我从中获得的输出如下:

id                        B          S
name date                             
A    06/24/2014   99.680581  91.006949
     06/25/2014  200.000000  10.000000
B    06/24/2014   10.000000   9.800000
     06/25/2014         NaN  99.276808

现在,对于每个“名称”“日期”,我想从“S”中减去id的“B”以得到“diff”列。

为此,我创建了一个新的数据框。为了提取我所做的指数:

name,date = zip(*list(df1.index.values))

df2 = pd.DataFrame({'name':name, 'date':date, 'B':list(df1['B']), 'S':list(df1['S'])})
df2['diff'] = df2['B'] - df2['S']

你能否建议一种更简洁的功能?此外,我希望它能够快速完成,因为我正在处理数百万行。 groupby是最好的方法吗?

谢谢,

1 个答案:

答案 0 :(得分:1)

我认为你可以使用reset_index然后减去:

df3 = df1.reset_index()

df3['diff'] = df3['B'] - df3['S']
print (df3)

id name        date           B          S        diff
0     A  06/24/2014   99.680581  91.006949    8.673632
1     A  06/25/2014  200.000000  10.000000  190.000000
2     B  06/24/2014   10.000000   9.800000    0.200000
3     B  06/25/2014         NaN  99.276808         NaN

编辑:

您的解决方案似乎是最快len(df)=100k

df = pd.concat([df]*10000).reset_index(drop=True)

In [114]: %timeit (df.groupby(['name','date','id'], as_index=False).apply(lambda x: np.average(x.value, weights=x.weights)))
10 loops, best of 3: 34.6 ms per loop

In [115]: %timeit ((df.value * df.weights).groupby([df.name,df.date,df.id]).sum() /  df.weights.groupby([df.name,df.date,df.id]).sum())
10 loops, best of 3: 38.4 ms per loop    

但最快的解决方案是:

df['value'] = df.value * df.weights
g = df.groupby(['name','date','id']) 
print (g['value'].sum() / g['weights'].sum())

In [125]: %timeit (a(df))
10 loops, best of 3: 20 ms per loop

测试代码

def a(df):
    df['value'] = df.value * df.weights
    g = df.groupby(['name','date','id']) 
    return (g['value'].sum() / g['weights'].sum())

print (a(df))   

EDIT1:

将解决方案与原始文件进行比较:

In [132]: %timeit (orig(df5))
10 loops, best of 3: 37.4 ms per loop

In [133]: %timeit (a(df))
10 loops, best of 3: 22.7 ms per loop

测试代码

df = pd.concat([df]*10000).reset_index(drop=True)
df5 = df.copy()

def orig(df):

    df1 = df.groupby(['name','date','id'], as_index=False).apply(lambda x: np.average(x['value'], weights=x['weights'])).unstack()   
    name,date = zip(*list(df1.index.values))

    df2 = pd.DataFrame({'name':name, 'date':date, 'B':list(df1['B']), 'S':list(df1['S'])})
    df2['diff'] = df2['B'] - df2['S']
    df2 = df2[['name','date','B','S','diff']]
    return df2

def a(df):
    df['value'] = df.value * df.weights
    g = df.groupby(['name','date','id']) 
    df2 = (g['value'].sum() / g['weights'].sum()).unstack().reset_index()
    df2['diff'] = df2['B'] - df2['S']
    return df2    

print (orig(df5))    
print (a(df))