具有多个权重和组python的加权平均值

时间:2019-10-11 09:49:33

标签: python pandas dataframe

我是Python的初学者,我正在尝试改进代码-因此,我希望您能就如何提高以下代码的效率提出一些建议。

我有以下数据集:

petdata = {
    'animal' : ['dog', 'cat', 'fish'],
    'male_1' : [0.57, 0.72, 0.62],
    'female_1' : [0.43, 0.28, 0.38],
    'age_01_1': [0.10,0.16,0.15],
    'age_15_1':[0.17,0.29,0.26],
    'age_510_1':[0.15,0.19,0.19],
    'age_1015_1':[0.18,0.16,0.17],
    'age_1520_1':[0.20,0.11,0.12],
    'age_20+_1':[0.20,0.09,0.10],
    'male_2' : [0.57, 0.72, 0.62],
    'female_2' : [0.43, 0.28, 0.38],
    'age_01_2': [0.10,0.16,0.15],
    'age_15_2':[0.17,0.29,0.26],
    'age_510_2':[0.15,0.19,0.19],
    'age_1015_2':[0.18,0.16,0.17],
    'age_1520_2':[0.20,0.11,0.12],
    'age_20+_2':[0.20,0.09,0.10],
    'weight_1': [10,20,30],
    'weight_2':[40,50,60]
}

df = pd.DataFrame(petdata) 

我想对所有以“ _1”结尾的变量使用weight_1,对所有以“ _2”结尾的变量使用weight_2来计算数据集中动物的加权平均值。

我目前正以这种方式进行操作:

df['male_wav_1']=np.nansum((df['male_1']*df['weight_1'])/df['weight_1'].sum())
df['female_wav_1']=np.nansum((df['female_1']*df['weight_1'])/df['weight_1'].sum())


df['male_wav_2']=np.nansum((df['male_2']*df['weight_2'])/df['weight_2'].sum())
df['female_wav_2']=np.nansum((df['female_2']*df['weight_2'])/df['weight_2'].sum())

这是我数据框中的每一列(即age_01_1_wav,age_15_1_wav ...)。我意识到这不是很整洁,所以有人可以给我一些有关如何改进流程的建议吗?

我试图:

  • 从宽到长重塑数据
  • 为加权平均值定义一个函数

但是我都没有成功。问题不在于重塑,我可以这样做,但是我不清楚如何将不同的权重应用于数据中的不同组。

非常感谢您的帮助。

2 个答案:

答案 0 :(得分:1)

首先,我假设“动物”列是您的索引,所以为了看起来像一张表,我将其作为索引:

import pandas as pd
import numpy as np
petdata = {
    # All of your data ^ above
}

df = pd.DataFrame(petdata)  # Creates the DF from your dictionary
df.set_index('animal',inplace=True) # Sets the 'animal' column as the index

我首先将您的DataFrame分为两部分:df_1和df_2

# Uses list comprehension to create a list of all column names with a given string
# in the name, and uses this list to get a sub-DataFrame for each
df_1 = df[[name for name in df.columns if '_1' in name]]
df_2 = df[[name for name in df.columns if '_2' in name]]

我宁愿在DataFrame中为每个已经存在的每个系列创建一个新的Series(列),而不是创建一个新行,作为每一列的加权平均值(wav)。由于新行将不是动物,所以它不会那么漂亮,但是索引“ wav”将在动物列中。

使用列表理解和您使用的方程式生成两个加权平均值列表:

wav_1 = [np.nansum(df[col]*df_1['weight_1'])/np.nansum(df_1['weight_1']) for col in df_1.columns]
wav_2 = [np.nansum(df[col]*df_1['weight_2'])/np.nansum(df_1['weight_2']) for col in df_2.columns]

然后使用新的“ wav”标签将此数据附加到两个DataFrame中:

df_1.loc['wav'] = wav_1
df_2.loc['wav'] = wav_2

请注意,“ wav”-“ weight_x”框中存在垃圾数据。这是您的体重的加权平均值。

欢迎使用Python!希望这会有所帮助。

答案 1 :(得分:0)

您可以使用Python zip()函数进行一些快速计算。

    petdata = {
        'animal' : ['dog', 'cat', 'fish'],
        'male_1' : [0.57, 0.72, 0.62],
        'age_20+_2':[0.20,0.09,0.10],
        'weight_1': [10,20,30],
        'weight_2':[40,50,60]
    }
weight_1 = petdata.get('weight_1')
male_1 = petdata.get('male_1')
for sales, costs in zip(weight_1, male_1):
    profit =sales * costs / sales
    print(f'Total profit: {profit}')

Total profit: 0.57
Total profit: 0.72
Total profit: 0.62