我是数据科学的新手,我想计算每个标题中的男女人数。
我尝试了以下代码:
'''
newdf = pd.DataFrame()
newdf[ 'Title' ] = full[ 'Name' ].map( lambda name: name.split( ',' )
[1].split( '.' )[0].strip() )
newdf['Age'] = full['Age']
newdf['Sex'] = full['Sex']
newdf.dropna(axis = 0,inplace=True)
print(newdf.head())
我得到的是:
Title Age Sex
0 Mr 22.0 male
1 Mrs 38.0 female
2 Miss 26.0 female
3 Mrs 35.0 female
4 Mr 35.0 male
然后我正在尝试添加#male,#female列
df = pd.DataFrame()
df = newdf[['Age','Title']].groupby('Title').mean().sort_values(by='Age',ascending=False)
df['#People'] = newdf['Title'].value_counts()
df['Male'] = newdf['Title'].sum(newdf['Sex']=='male')
df['Female'] = newdf['Title'].sum(newdf['Sex']=='female')
我收到的错误消息: TypeError:“系列”对象是可变的,因此无法进行散列
我希望有四列:标题,年龄(平均),#人,#male,#female。所以我想知道这些#people中有多少男性和女性
P.S没有这些行:
df['Male'] = newdf['Title'].sum(newdf['Sex']=='male')
df['Female'] = newdf['Title'].sum(newdf['Sex']=='female')
一切正常,我得到:
Age #People
Title
Capt 70.000000 1
Col 54.000000 4
Sir 49.000000 1
Major 48.500000 2
Lady 48.000000 1
Dr 43.571429 7
....
但没有#male,#female。
答案 0 :(得分:1)
将GroupBy.agg
与mean
一起用于汇总size
,对于新列,请在crosstab
的前面加上DataFrame.join
:
df1 = (df.groupby('Title')['Age']
.agg([('Age','mean'),('#People','size')])
.sort_values(by='Age',ascending=False))
df2 = pd.crosstab(df['Title'], df['Sex']).add_suffix('_avg')
df = df1.join(df2)
print (df)
Age #People female_avg male_avg
Title
Mrs 36.5 2 2 0
Mr 28.5 2 0 2
Miss 26.0 1 1 0