将数据框中的各个值单独乘以

时间:2017-05-28 00:51:59

标签: python pandas dataframe

我正在尝试为年龄和性别值创建一个简单的加权工具。该脚本首先查看性别分布的方式,将其与所需分布进行比较,并相应地更新权重列(例如,从1到1.12)。然后它会查看年龄分布的方式(记住新分配的权重),将其与所需的分布进行比较,然后再次更新权重列。

它在第一轮中运行良好,但在第二步中它不起作用。我知道为什么,但我不知道如何解决。看起来它正在捡起它看到的第一个值并将其应用到整个板上,而我真的需要它来单独评估每个单元格。

以此为例:在第一轮中,雌性细胞得到0.98,雄性细胞得到1.02。然后它去计算年龄权重,并且它正在寻找年龄值1.假设它看到的第一个年龄= 1的值是女性。然后,对于所有年龄= 1的细胞,它会将年龄= 1的权重乘以.98,即使年龄值为1的男性应该真正乘以1.02。

以下是我的脚本在数据子集上的完整功能版本。我怎样才能让它单独评估每个细胞?

    weightdict = {'Gender' : {1 : .49, 2 : .51}, 'Age' : {1 : 0.08, 2 : .27, 3 : .31, 4 : .34}}
    weightframe = pd.DataFrame.from_dict(weightdict,orient='columns')
    df = {'Gender': {0: 1, 1: 2, 2: 2, 3: 1, 4: 2, 5: 1, 6: 2, 7: 1, 8: 1, 9: 2},
    'Age': {0: 3, 1: 3, 2: 2, 3: 1, 4: 4, 5: 2, 6: 3, 7: 4, 8: 4, 9: 3}}
    df = pd.DataFrame.from_dict(df,orient='columns')
df.loc[:,'Weight'] = 1 #add a dummy weight column

def getaverage(column):
    average = df.groupby(column)['Weight'].sum()/df['Weight'].sum() #find distribution within dataset for each value
    average = weightframe[column].div(average) #find what % the value is still under/overrepresented
    average = average.reset_index()
    average = average.rename(columns={'index' : 'variable',0 : column})
    return average

def multiply(x):
    value = df.loc[df[column]==x,column].iloc[0] #get value from table to evaluate
    weight = df.loc[df[column]==value,'Weight'].iloc[0] #get the value's currently assigned weight
    newvalue = average.loc[average['variable']==value, column].iloc[0] #get the value's degree of over/underrepresentation
    return newvalue*weight #multiply the new weight by the old weight

for column in list(df)[0:2]:
    average = getaverage(column) #get set of averages for column
    df['Weight'] = df[column].apply((lambda x : multiply(x)))
    print(df)

1 个答案:

答案 0 :(得分:0)

我很难准确理解你的最终输出应该包含什么,但我认为我理解得足以提供一些帮助。

从你的df和权重开始,我得到一张看起来像这样的表(我现在离开了假重量栏):

weightdict = {
               'Gender': {1: 0.49, 2: 0.51}, 
                  'Age': {1: 0.08, 2: 0.27, 3 : .31, 4 : .34}
              }
weightframe = pd.DataFrame.from_dict(weightdict,orient='columns')

df = {
      'Gender': {0: 1, 1: 2, 2: 2, 3: 1, 4: 2, 5: 1, 6: 2, 7: 1, 8: 1, 9: 2},
         'Age': {0: 3, 1: 3, 2: 2, 3: 1, 4: 4, 5: 2, 6: 3, 7: 4, 8: 4, 9: 3}
     }
df = pd.DataFrame.from_dict(df,orient='columns')

df

Out[]: 
   Age  Gender
0    3       1
1    3       2
2    2       2
3    1       1
4    4       2
5    2       1
6    3       2
7    4       1
8    4       1
9    3       2

然后,您可以遍历AgeGender列,并创建一个代表各自权重的新列:

for col in df[0:2]: 
    # create a dict (printed for clarity) with the calculated weight 
    # for each entry to feed to the following `map` statements
    dist = {idx: count/len(df) for idx, count in df[col].value_counts().items()}
    print(col)
    print(dist)
    df[col + 'Weight'] = df[col].map(weightdict[col]) / df[col].map(dist)

df

-----
Age
{3: 0.40000000000000002, 4: 0.29999999999999999, 
 2: 0.20000000000000001, 1: 0.10000000000000001}
Gender
{2: 0.5, 1: 0.5} 

生成的df如下所示:

Out[]: 
   Age  Gender  AgeWeight  GenderWeight
0    3       1   0.775000          0.98
1    3       2   0.775000          1.02
2    2       2   1.350000          1.02
3    1       1   0.800000          0.98
4    4       2   1.133333          1.02
5    2       1   1.350000          0.98
6    3       2   0.775000          1.02
7    4       1   1.133333          0.98
8    4       1   1.133333          0.98
9    3       2   0.775000          1.02

然后,您可以跨行执行计算。我不确定这是否是你想要的实际计算,但你应该能够适应这种方法。

df['Weighted'] = df.AgeWeight * df.GenderWeight

df

Out[16]: 
   Age  Gender  AgeWeight  GenderWeight  Weighted
0    3       1   0.775000          0.98  0.759500
1    3       2   0.775000          1.02  0.790500
2    2       2   1.350000          1.02  1.377000
3    1       1   0.800000          0.98  0.784000
4    4       2   1.133333          1.02  1.156000
5    2       1   1.350000          0.98  1.323000
6    3       2   0.775000          1.02  0.790500
7    4       1   1.133333          0.98  1.110667
8    4       1   1.133333          0.98  1.110667
9    3       2   0.775000          1.02  0.790500

清理df:

del df['AgeWeight']
del df['GenderWeight']

df

Out[12]: 
   Age  Gender  Weighted
0    3       1  2.278500
1    3       2  2.371500
2    2       2  2.754000
3    1       1  0.784000
4    4       2  4.624000
5    2       1  2.646000
6    3       2  2.371500
7    4       1  4.442667
8    4       1  4.442667
9    3       2  2.371500