熊猫:如何应用每个子组的功能

时间:2017-11-12 17:44:50

标签: python pandas pandas-groupby pandas-apply

我有一个简单的数据框,其中包含国籍,职业和年龄的列。 欧盟,美国,亚洲的民族编号为0,1,2。

对于每个职业,我想找到每个国籍的百分比 例如:67%的医生是欧洲人,33%是亚洲人。

import pandas as pd
import numpy as np
#create dataframe
df=pd.DataFrame(np.concatenate((np.random.randint(low=0, high=3, size=   (10,1)),np.random.randint(low=24, high=70, size=(10,1))),axis=1))
df.columns=['nationality','age']
df['occupation']=['teacher']*2+['engineer']*3+['doctor']*3+['lawyer']*2


  nationality   age occupation
0   0   65  teacher
1   0   31  teacher
2   0   30  engineer
3   2   63  engineer
4   0   28  engineer
5   1   27  doctor
6   0   52  doctor
7   0   60  doctor
8   0   33  lawyer
9   0   38  lawyer

df.groupby(['occupation','nationality']).count()

def iseuropean(x):
    if x==0:
        return 1
    else:
        return 0
def isamerican(x):
    if x==1:
        return 1
    else:
        return 0
def isasian(x):
    if x==2:
        return 1
    else:
        return 0

使用groupby我可以获得计数,但我想为每个职业组应用一个确定百分比的函数。但是,我无法弄明白。

非常感谢任何帮助。

1 个答案:

答案 0 :(得分:2)

我认为你正在寻找每个职业的国籍百分比:

In [11]: c = df.groupby(['occupation','nationality'])["age"].count().rename("count")

In [12]: c
Out[12]:
occupation  nationality
doctor      0              2
            1              1
engineer    0              2
            2              1
lawyer      0              2
teacher     0              2
Name: count, dtype: int64

In [13]: c / c.sum()  # proportion of each, maybe not very useful
Out[13]:
occupation  nationality
doctor      0              0.2
            1              0.1
engineer    0              0.2
            2              0.1
lawyer      0              0.2
teacher     0              0.2
Name: count, dtype: float64

In [14]: c / c.groupby(level=0).sum()  # proportion of each occupation
Out[14]:
occupation  nationality
doctor      0              0.666667
            1              0.333333
engineer    0              0.666667
            2              0.333333
lawyer      0              1.000000
teacher     0              1.000000
Name: count, dtype: float64

除了你可能想要使用分类代码而不是is_XXX:

In [21]: pd.Categorical.from_codes(df.nationality, ["european", "american", "asian"])
Out[21]:
[european, european, european, asian, european, american, european, european, european, european]
Categories (3, object): [european, american, asian]

In [22]: df.nationality = pd.Categorical.from_codes(df.nationality, ["european", "american", "asian"])

In [23]: df
Out[23]:
  nationality  age occupation
0    european   65    teacher
1    european   31    teacher
2    european   30   engineer
3       asian   63   engineer
4    european   28   engineer
5    american   27     doctor
6    european   52     doctor
7    european   60     doctor
8    european   33     lawyer
9    european   38     lawyer