我有一个如下数据框:
gender doctor name
female A
male B
male A
female C
female B
如何根据医生的姓名进行性别价值计数?
答案 0 :(得分:3)
我认为您需要groupby
汇总GroupBy.size
:
df = df.groupby(['doctor name','gender']).size()
print (df)
doctor name gender
A female 1
male 1
B female 1
male 1
C female 1
dtype: int64
它与SeriesGroupBy.value_counts
的输出相同,只对每组中的值进行排序:
df = df.groupby(['doctor name']).gender.value_counts()
print (df)
doctor name gender
A female 1
male 1
B female 1
male 1
C female 1
Name: gender, dtype: int64
<强>计时强>:
#[1000000 rows x 2 columns]
np.random.seed(123)
N = 1000000
L1 = ['e', 'f','g','h', 'i', 'j']
L2 = ['female','male']
df = pd.DataFrame({'gender':np.random.choice(L1, N),
'doctor name': np.random.choice(L2, N)})
#print (df)
In [43]: %timeit (df.groupby(['doctor name','gender']).size())
10 loops, best of 3: 141 ms per loop
In [44]: %timeit (df.groupby(['doctor name']).gender.value_counts())
1 loop, best of 3: 254 ms per loop
#output is sorted by index
print (df.groupby(['doctor name','gender']).size())
doctor name gender
female e 82944
f 83422
g 83706
h 83200
i 83004
j 83521
male e 83405
f 83503
g 82891
h 83666
i 83525
j 83213
dtype: int64
#output is same, only sorted
print (df.groupby(['doctor name']).gender.value_counts())
doctor name gender
female g 83706
j 83521
f 83422
h 83200
i 83004
e 82944
male h 83666
i 83525
f 83503
e 83405
j 83213
g 82891
Name: gender, dtype: int64
如果需要crosstab
最快的解决方案,请添加unstack
:
df1 = df.groupby(['doctor name','gender']).size().unstack(level=0, fill_value=0)
print (df1)
doctor name A B C
gender
female 1 1 1
male 1 1 0
df2 = df.groupby(['doctor name','gender']).size().unstack(fill_value=0)
print (df2)
gender female male
doctor name
A 1 1
B 1 1
C 1 0
<强> Timings2 强>:
#used same df as in timings
In [64]: %timeit (df.groupby(['doctor name','gender']).size().unstack(level=0, fill_value=0))
10 loops, best of 3: 141 ms per loop
In [65]: %timeit (pd.crosstab(df.gender, df['doctor name']))
1 loop, best of 3: 215 ms per loop
In [66]: %timeit (pd.pivot_table(df, index='gender', columns='doctor name', aggfunc=len, fill_value=0))
1 loop, best of 3: 251 ms per loop
答案 1 :(得分:3)
除了groupby
之外,如果您正在寻找交叉表数据帧视图。有两种直接的方法可以做到。
使用pd.crosstab
In [52]: pd.crosstab(df.gender, df.doctor)
Out[52]:
doctor A B C
gender
female 1 1 1
male 1 1 0
使用pd.pivot_table
In [53]: pd.pivot_table(df, index='gender', columns='doctor', aggfunc=len, fill_value=0)
Out[53]:
doctor A B C
gender
female 1 1 1
male 1 1 0