我有一个available here
数据集它给我们像DataFrame
一样
df=pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', sep='|')
df.head()
user_id age gender occupation zip_code
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213
我想了解的是,每个occupation
中男性与女性的比例
我已经在下面使用给定的功能,但这不是最佳方法。
df.groupby(['occupation', 'gender']).agg({'gender':'count'}).div(df.groupby('occupation').agg('count'), level='occupation')['gender']*100
这使我们得到的结果类似于
occupation gender
administrator F 45.569620
M 54.430380
artist F 46.428571
M 53.571429
上面的ans格式非常不同,因为我想要的是:(演示)
occupation M:F
programmer 2:3
farmer 7:2
有人可以告诉我如何创建自己的聚合函数吗?
答案 0 :(得分:1)
实际上,pandas内置了value_counts(normalized=True)
来计算值计数。然后,您可以玩一些数字:
new_df = (df.groupby('occupation')['gender']
.value_counts(normalize=True) # this gives normalized counts: 0.45
.unstack('gender', fill_value=0)
.round(2) # get two significant digits
.mul(100) # get the percentage
.astype(int) # get rid of .0000
.astype(str) # turn to string
)
new_df['F:M'] = new_df['F'] + ':' + new_df['M']
new_df.head()
输出:
gender F M F:M
occupation
administrator 46 54 46:54
artist 46 54 46:54
doctor 0 100 0:100
educator 27 73 27:73
engineer 3 97 3:97
答案 1 :(得分:0)
这对您有用吗
df_g = df.groupby(['occupation', 'gender']).count().user_id/df.groupby(['occupation']).count().user_id
df_g = df_g.reset_index()
df_g['ratio'] = df_g['user_id'].apply(lambda x: str(Fraction(x).limit_denominator()).replace('/',':'))
输出
occupation gender user_id ratio
0 administrator F 0.455696 36:79
1 administrator M 0.544304 43:79
2 artist F 0.464286 13:28
3 artist M 0.535714 15:28
4 doctor M 1.000000 1
5 educator F 0.273684 26:95
6 educator M 0.726316 69:95
7 engineer F 0.029851 2:67
8 engineer M 0.970149 65:67
9 entertainment F 0.111111 1:9
10 entertainment M 0.888889 8:9
11 executive F 0.093750 3:32
12 executive M 0.906250 29:32
13 healthcare F 0.687500 11:16
14 healthcare M 0.312500 5:16
15 homemaker F 0.857143 6:7
16 homemaker M 0.142857 1:7
17 lawyer F 0.166667 1:6
18 lawyer M 0.833333 5:6
19 librarian F 0.568627 29:51
20 librarian M 0.431373 22:51
21 marketing F 0.384615 5:13
22 marketing M 0.615385 8:13
23 none F 0.444444 4:9
24 none M 0.555556 5:9
25 other F 0.342857 12:35
26 other M 0.657143 23:35
27 programmer F 0.090909 1:11
28 programmer M 0.909091 10:11
29 retired F 0.071429 1:14
30 retired M 0.928571 13:14
31 salesman F 0.250000 1:4
32 salesman M 0.750000 3:4
33 scientist F 0.096774 3:31
34 scientist M 0.903226 28:31
35 student F 0.306122 15:49
36 student M 0.693878 34:49
37 technician F 0.037037 1:27
38 technician M 0.962963 26:27
39 writer F 0.422222 19:45
40 writer M 0.577778 26:45
答案 2 :(得分:0)
实际上很容易。 groupby
之后的每个组都是一个数据帧(初始数据帧的一部分),因此您可以apply
使用自己的函数来处理此部分数据帧。您可以在compute_gender_ratio
内添加打印语句,然后查看df
是什么。
import pandas as pd
data = pd.read_csv(
'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user',
sep='|')
def compute_gender_ratio(df):
gender_count = df['gender'].value_counts()
return f"{gender_count.get('M', 0)}:{gender_count.get('F', 0)}"
result = data.groupby('occupation').apply(compute_gender_ratio)
result_df = result.to_frame(name='M:F')
result_df
是:
M:F
occupation
administrator 43:36
artist 15:13
doctor 7:0
educator 69:26
engineer 65:2
entertainment 16:2
executive 29:3
healthcare 5:11
homemaker 1:6
lawyer 10:2
librarian 22:29
marketing 16:10
none 5:4
other 69:36
programmer 60:6
retired 13:1
salesman 9:3
scientist 28:3
student 136:60
technician 26:1
writer 26:19