我有一个包含
之类的pandas数据框 +------+--------+-----+-------+
| Team | Gender | Age | Name |
+------+--------+-----+-------+
| A | M | 22 | Sam |
| A | F | 25 | Annie |
| B | M | 33 | Fred |
| B | M | 18 | James |
| A | M | 56 | Alan |
| B | F | 28 | Julie |
| A | M | 33 | Greg |
+------+--------+-----+-------+
我要做的是先按Team
和Gender
进行分组(我可以使用df.groupby(['Team'], as_index=False)
有没有办法根据年龄对组成员进行排序,并在其中添加额外的列,这些列表示有多少成员高于任何特定成员以及下面有多少成员?
例如: 对于团队'A队':
+------+--------+-----+-------+---------+---------+---------+---------+
| Team | Gender | Age | Name | M_Above | M_Below | F_Above | F_Below |
+------+--------+-----+-------+---------+---------+---------+---------+
| A | M | 22 | Sam | 0 | 2 | 0 | 1 |
| A | F | 25 | Annie | 1 | 2 | 0 | 0 |
| A | M | 33 | Greg | 1 | 1 | 1 | 0 |
| A | M | 56 | Alan | 2 | 0 | 1 | 0 |
+------+--------+-----+-------+---------+---------+---------+---------+
答案 0 :(得分:2)
import pandas as pd
df = pd.DataFrame({'Team':['A','A','B','B','A','B','A'], 'Gender':['M','F','M','M','M','F','M'],
'Age':[22,25,33,18,56,28,33], 'Name':['Sam','Annie','Fred','James','Alan','Julie','Greg']}).sort_values(['Team','Age'])
for idx, data in df.groupby(['Team'], as_index=False):
m_tot = data['Gender'].value_counts()[0] # number of males in current team
f_tot = data['Gender'].value_counts()[1] # dido^ (females)
m_seen = 0 # males seen so far for current team
f_seen = 0 # dido^ (females)
for row in data.iterrows():
(M_Above, M_below, F_Above, F_Below) = (m_seen, m_tot-m_seen, f_seen, f_tot-f_seen)
if row[1].Gender == 'M':
m_seen += 1
M_below -= 1
else:
f_seen += 1
F_Below -= 1
df.loc[row[0],'M_Above'] = M_Above
df.loc[row[0],'M_Below'] = M_below
df.loc[row[0],'F_Above'] = F_Above
df.loc[row[0],'F_Below'] = F_Below
结果为:
Age Gender Team M_Above M_below F_Above F_Below
0 22 M A 0.0 2.0 0.0 1.0
1 25 F A 1.0 2.0 0.0 0.0
6 33 M A 1.0 1.0 1.0 0.0
4 56 M A 2.0 0.0 1.0 0.0
3 18 M B 0.0 1.0 0.0 1.0
5 28 F B 1.0 1.0 0.0 0.0
2 33 M B 1.0 0.0 1.0 0.0
如果您希望将新列设为int
(如示例所示),请使用:
for new_col in ['M_Above', 'M_Below', 'F_Above', 'F_Below']:
df[new_col] = df[new_col].astype(int)
结果如何:
Age Gender Name Team M_Above M_Below F_Above F_Below
0 22 M Sam A 0 2 0 1
1 25 F Annie A 1 2 0 0
6 33 M Greg A 1 1 1 0
4 56 M Alan A 2 0 1 0
3 18 M James B 0 1 0 1
5 28 F Julie B 1 1 0 0
2 33 M Fred B 1 0 1 0
编辑:(运行时间比较)
请注意,此解决方案比使用ix
(已批准的解决方案)更快。平均运行时间(超过1000次迭代) ~6倍(这在较大的DataFrame中可能很重要)。运行此选项以检查:
import pandas as pd
from time import time
import numpy as np
def f(x):
for i,d in x.iterrows():
above = x.ix[:i, 'Gender'].drop(i).value_counts().reindex(['M','F'])
below = x.ix[i:, 'Gender'].drop(i).value_counts().reindex(['M','F'])
x.ix[i,'M_Above'] = above.ix['M']
x.ix[i,'M_Below'] = below.ix['M']
x.ix[i,'F_Above'] = above.ix['F']
x.ix[i,'F_Below'] = below.ix['F']
return x
df = pd.DataFrame({'Team':['A','A','B','B','A','B','A'], 'Gender':['M','F','M','M','M','F','M'],
'Age':[22,25,33,18,56,28,33], 'Name':['Sam','Annie','Fred','James','Alan','Julie','Greg']}).sort_values(['Team','Age'])
times = []
times2 = []
for i in range(1000):
tic = time()
for idx, data in df.groupby(['Team'], as_index=False):
m_tot = data['Gender'].value_counts()[0] # number of males in current team
f_tot = data['Gender'].value_counts()[1] # dido^ (females)
m_seen = 0 # males seen so far for current team
f_seen = 0 # dido^ (females)
for row in data.iterrows():
(M_Above, M_below, F_Above, F_Below) = (m_seen, m_tot-m_seen, f_seen, f_tot-f_seen)
if row[1].Gender == 'M':
m_seen += 1
M_below -= 1
else:
f_seen += 1
F_Below -= 1
df.loc[row[0],'M_Above'] = M_Above
df.loc[row[0],'M_Below'] = M_below
df.loc[row[0],'F_Above'] = F_Above
df.loc[row[0],'F_Below'] = F_Below
toc = time()
times.append(toc-tic)
for i in range(1000):
tic = time()
df1 = df.groupby('Team', sort=False).apply(f).fillna(0)
df1.ix[:,'M_Above':] = df1.ix[:,'M_Above':].astype(int)
toc = time()
times2.append(toc-tic)
print(np.mean(times))
print(np.mean(times2))
结果:
0.0163134906292 # alternative solution
0.0622982912064 # approved solution
答案 1 :(得分:1)
您可以按f
列groupby
应用自定义功能Team
。
在函数f
中,对于每一行,首先按ix
上下过滤值,然后drop
值,并按value_counts
获取所需值。缺少某些值,因此需要reindex
,然后按ix
选择:
def f(x):
for i,d in x.iterrows():
above = x.ix[:i, 'Gender'].drop(i).value_counts().reindex(['M','F'])
below = x.ix[i:, 'Gender'].drop(i).value_counts().reindex(['M','F'])
x.ix[i,'M_Above'] = above.ix['M']
x.ix[i,'M_Below'] = below.ix['M']
x.ix[i,'F_Above'] = above.ix['F']
x.ix[i,'F_Below'] = below.ix['F']
return x
df1 = df.groupby('Team', sort=False).apply(f).fillna(0)
#cast float to int
df1.ix[:,'M_Above':] = df1.ix[:,'M_Above':].astype(int)
print (df1)
Age Gender Name Team M_Above M_Below F_Above F_Below
0 22 M Sam A 0 2 0 1
1 25 F Annie A 1 2 0 0
6 33 M Greg A 1 1 1 0
4 56 M Alan A 2 0 1 0
3 18 M James B 0 1 0 1
5 28 F Julie B 1 1 0 0
2 33 M Fred B 1 0 1 0