我正在尝试为年龄和性别值创建一个简单的加权工具。该脚本首先查看性别分布的方式,将其与所需分布进行比较,并相应地更新权重列(例如,从1到1.12)。然后它会查看年龄分布的方式(记住新分配的权重),将其与所需的分布进行比较,然后再次更新权重列。
它在第一轮中运行良好,但在第二步中它不起作用。我知道为什么,但我不知道如何解决。看起来它正在捡起它看到的第一个值并将其应用到整个板上,而我真的需要它来单独评估每个单元格。
以此为例:在第一轮中,雌性细胞得到0.98,雄性细胞得到1.02。然后它去计算年龄权重,并且它正在寻找年龄值1.假设它看到的第一个年龄= 1的值是女性。然后,对于所有年龄= 1的细胞,它会将年龄= 1的权重乘以.98,即使年龄值为1的男性应该真正乘以1.02。
以下是我的脚本在数据子集上的完整功能版本。我怎样才能让它单独评估每个细胞?
weightdict = {'Gender' : {1 : .49, 2 : .51}, 'Age' : {1 : 0.08, 2 : .27, 3 : .31, 4 : .34}}
weightframe = pd.DataFrame.from_dict(weightdict,orient='columns')
df = {'Gender': {0: 1, 1: 2, 2: 2, 3: 1, 4: 2, 5: 1, 6: 2, 7: 1, 8: 1, 9: 2},
'Age': {0: 3, 1: 3, 2: 2, 3: 1, 4: 4, 5: 2, 6: 3, 7: 4, 8: 4, 9: 3}}
df = pd.DataFrame.from_dict(df,orient='columns')
df.loc[:,'Weight'] = 1 #add a dummy weight column
def getaverage(column):
average = df.groupby(column)['Weight'].sum()/df['Weight'].sum() #find distribution within dataset for each value
average = weightframe[column].div(average) #find what % the value is still under/overrepresented
average = average.reset_index()
average = average.rename(columns={'index' : 'variable',0 : column})
return average
def multiply(x):
value = df.loc[df[column]==x,column].iloc[0] #get value from table to evaluate
weight = df.loc[df[column]==value,'Weight'].iloc[0] #get the value's currently assigned weight
newvalue = average.loc[average['variable']==value, column].iloc[0] #get the value's degree of over/underrepresentation
return newvalue*weight #multiply the new weight by the old weight
for column in list(df)[0:2]:
average = getaverage(column) #get set of averages for column
df['Weight'] = df[column].apply((lambda x : multiply(x)))
print(df)
答案 0 :(得分:0)
我很难准确理解你的最终输出应该包含什么,但我认为我理解得足以提供一些帮助。
从你的df和权重开始,我得到一张看起来像这样的表(我现在离开了假重量栏):
weightdict = {
'Gender': {1: 0.49, 2: 0.51},
'Age': {1: 0.08, 2: 0.27, 3 : .31, 4 : .34}
}
weightframe = pd.DataFrame.from_dict(weightdict,orient='columns')
df = {
'Gender': {0: 1, 1: 2, 2: 2, 3: 1, 4: 2, 5: 1, 6: 2, 7: 1, 8: 1, 9: 2},
'Age': {0: 3, 1: 3, 2: 2, 3: 1, 4: 4, 5: 2, 6: 3, 7: 4, 8: 4, 9: 3}
}
df = pd.DataFrame.from_dict(df,orient='columns')
df
Out[]:
Age Gender
0 3 1
1 3 2
2 2 2
3 1 1
4 4 2
5 2 1
6 3 2
7 4 1
8 4 1
9 3 2
然后,您可以遍历Age
和Gender
列,并创建一个代表各自权重的新列:
for col in df[0:2]:
# create a dict (printed for clarity) with the calculated weight
# for each entry to feed to the following `map` statements
dist = {idx: count/len(df) for idx, count in df[col].value_counts().items()}
print(col)
print(dist)
df[col + 'Weight'] = df[col].map(weightdict[col]) / df[col].map(dist)
df
-----
Age
{3: 0.40000000000000002, 4: 0.29999999999999999,
2: 0.20000000000000001, 1: 0.10000000000000001}
Gender
{2: 0.5, 1: 0.5}
生成的df如下所示:
Out[]:
Age Gender AgeWeight GenderWeight
0 3 1 0.775000 0.98
1 3 2 0.775000 1.02
2 2 2 1.350000 1.02
3 1 1 0.800000 0.98
4 4 2 1.133333 1.02
5 2 1 1.350000 0.98
6 3 2 0.775000 1.02
7 4 1 1.133333 0.98
8 4 1 1.133333 0.98
9 3 2 0.775000 1.02
然后,您可以跨行执行计算。我不确定这是否是你想要的实际计算,但你应该能够适应这种方法。
df['Weighted'] = df.AgeWeight * df.GenderWeight
df
Out[16]:
Age Gender AgeWeight GenderWeight Weighted
0 3 1 0.775000 0.98 0.759500
1 3 2 0.775000 1.02 0.790500
2 2 2 1.350000 1.02 1.377000
3 1 1 0.800000 0.98 0.784000
4 4 2 1.133333 1.02 1.156000
5 2 1 1.350000 0.98 1.323000
6 3 2 0.775000 1.02 0.790500
7 4 1 1.133333 0.98 1.110667
8 4 1 1.133333 0.98 1.110667
9 3 2 0.775000 1.02 0.790500
清理df:
del df['AgeWeight']
del df['GenderWeight']
df
Out[12]:
Age Gender Weighted
0 3 1 2.278500
1 3 2 2.371500
2 2 2 2.754000
3 1 1 0.784000
4 4 2 4.624000
5 2 1 2.646000
6 3 2 2.371500
7 4 1 4.442667
8 4 1 4.442667
9 3 2 2.371500