我有一个pandas数据框,其中包含两个分类变量(在我的示例中,城市和颜色),一个包含百分比的列,另一个包含权重。 我想做一个城市和颜色的交叉表,显示两者的每个组合,perc的加权平均值。
我已经设法使用下面的代码,我首先使用权重x perc创建一个列,然后使用(权重x perc)之和创建一个交叉表,使用权重之和创建另一个交叉表,然后最后划分首先是第二个。
它有效,但有更快/更优雅的方式吗?谢谢!
import pandas as pd
import numpy as np
np.random.seed(123)
df=pd.DataFrame()
myrows=10
df['weight'] = np.random.rand(myrows)*100
np.random.seed(321)
df['perc']=np.random.rand(myrows)
df['weight x perc']=df['weight']*df['perc']
df['colour']=np.where( df['perc']<0.5, 'red','yellow')
np.random.seed(555)
df['city']=np.where( np.random.rand(myrows) <0.5,'NY','LA' )
num=pd.crosstab( df['city'], df['colour'], values=df['weight x perc'], aggfunc='sum', margins=True)
den=pd.crosstab( df['city'], df['colour'], values=df['weight'], aggfunc='sum', margins=True)
out=num/den
print(out)
答案 0 :(得分:3)
这里使用带有apply()的groupby并使用numpy加权平均方法。
df.groupby(['colour','city']).apply(lambda x: np.average(x.perc, weights=x.weight)).unstack(level=0)
给出了
colour red yellow
city
LA 0.173870 0.865636
NY 0.077912 0.687400
虽然我没有全部收益。
这将产生总数
df.groupby(['colour']).apply(lambda x: np.average(x.perc, weights=x.weight))
df.groupby(['city']).apply(lambda x: np.average(x.perc, weights=x.weight))
当然仍未打包成单帧
答案 1 :(得分:1)
我有同样的问题。我现在找到了解决方案!是! 我是编程的初学者,我的代码可以运行,但是可以改进。 我希望它可以帮助某人。 谢谢大家的帮助。
import pandas as pd
import numpy as np
np.random.seed(123)
df=pd.DataFrame()
myrows=10
df['weight'] = np.random.rand(myrows)*100
np.random.seed(321)
df['perc']=np.random.rand(myrows)
#df['weight x perc']=df['weight']*df['perc']
df['colour']=np.where( df['perc']<0.5, 'red','yellow')
np.random.seed(555)
df['city']=np.where( np.random.rand(myrows) <0.5,'NY','LA' )
df.head()
和标签创建:
grouped = df.groupby(['colour','city'])
ci = df.groupby('city')
co =df.groupby('colour')
def wavg(group):
d = group['perc']
w = group['weight']
return (d * w).sum() / w.sum()
# adding 1 columns of aggregate the weighted average by row
result = pd.concat([grouped.apply(wavg).unstack(level=0),ci.apply(wavg).rename('weighted average')],axis = 1)
# change index because unstack get city is index, i want city in columns
result.index.name='city'
result.reset_index(inplace=True)
# adding 1 row in bottom of aggregate the weighted average by columns and weighted average of the dataframe (total)
# be careful about co.ndim i'm not sure is alway ok
result.loc['WAvg'] = ['weighted average'] + [co.apply(wavg).values.round(1)[i] for i in range(0,co.ndim)] + [((df['perc'] * df['weight']).sum() / df['weight'].sum()).round(1)]
result
请参阅结果: