Question

我有一个包含

之类的pandas数据框

+------+--------+-----+-------+ | Team | Gender | Age | Name | +------+--------+-----+-------+ | A | M | 22 | Sam | | A | F | 25 | Annie | | B | M | 33 | Fred | | B | M | 18 | James | | A | M | 56 | Alan | | B | F | 28 | Julie | | A | M | 33 | Greg | +------+--------+-----+-------+

我要做的是先按Team和Gender进行分组（我可以使用df.groupby(['Team'], as_index=False)

进行分组

有没有办法根据年龄对组成员进行排序，并在其中添加额外的列，这些列表示有多少成员高于任何特定成员以及下面有多少成员？

例如：对于团队'A队'：

+------+--------+-----+-------+---------+---------+---------+---------+ | Team | Gender | Age | Name | M_Above | M_Below | F_Above | F_Below | +------+--------+-----+-------+---------+---------+---------+---------+ | A | M | 22 | Sam | 0 | 2 | 0 | 1 | | A | F | 25 | Annie | 1 | 2 | 0 | 0 | | A | M | 33 | Greg | 1 | 1 | 1 | 0 | | A | M | 56 | Alan | 2 | 0 | 1 | 0 | +------+--------+-----+-------+---------+---------+---------+---------+

Answer 1

import pandas as pd

df = pd.DataFrame({'Team':['A','A','B','B','A','B','A'], 'Gender':['M','F','M','M','M','F','M'],
               'Age':[22,25,33,18,56,28,33], 'Name':['Sam','Annie','Fred','James','Alan','Julie','Greg']}).sort_values(['Team','Age'])

for idx, data in df.groupby(['Team'], as_index=False):
    m_tot = data['Gender'].value_counts()[0]    # number of males in current team
    f_tot = data['Gender'].value_counts()[1]    # dido^ (females)
    m_seen = 0                                  # males seen so far for current team
    f_seen = 0                                  # dido^ (females)

    for row in data.iterrows():
        (M_Above, M_below, F_Above, F_Below) = (m_seen, m_tot-m_seen, f_seen, f_tot-f_seen)
        if row[1].Gender == 'M':
            m_seen += 1
            M_below -= 1
        else:
            f_seen += 1
            F_Below -= 1

        df.loc[row[0],'M_Above'] = M_Above
        df.loc[row[0],'M_Below'] = M_below
        df.loc[row[0],'F_Above'] = F_Above
        df.loc[row[0],'F_Below'] = F_Below

结果为：

   Age Gender Team  M_Above  M_below  F_Above  F_Below
0   22      M    A      0.0      2.0      0.0      1.0
1   25      F    A      1.0      2.0      0.0      0.0
6   33      M    A      1.0      1.0      1.0      0.0
4   56      M    A      2.0      0.0      1.0      0.0
3   18      M    B      0.0      1.0      0.0      1.0
5   28      F    B      1.0      1.0      0.0      0.0
2   33      M    B      1.0      0.0      1.0      0.0

如果您希望将新列设为int（如示例所示），请使用：

for new_col in ['M_Above', 'M_Below', 'F_Above', 'F_Below']:
    df[new_col] = df[new_col].astype(int)

结果如何：

   Age Gender   Name Team  M_Above  M_Below  F_Above  F_Below
0   22      M    Sam    A        0        2        0        1
1   25      F  Annie    A        1        2        0        0
6   33      M   Greg    A        1        1        1        0
4   56      M   Alan    A        2        0        1        0
3   18      M  James    B        0        1        0        1
5   28      F  Julie    B        1        1        0        0
2   33      M   Fred    B        1        0        1        0

编辑：（运行时间比较）

请注意，此解决方案比使用ix（已批准的解决方案）更快。平均运行时间（超过1000次迭代） ~6倍（这在较大的DataFrame中可能很重要）。运行此选项以检查：

import pandas as pd
from time import time
import numpy as np

def f(x):
    for i,d in x.iterrows():
        above = x.ix[:i, 'Gender'].drop(i).value_counts().reindex(['M','F'])
        below = x.ix[i:, 'Gender'].drop(i).value_counts().reindex(['M','F'])
        x.ix[i,'M_Above'] = above.ix['M']
        x.ix[i,'M_Below'] = below.ix['M']
        x.ix[i,'F_Above'] = above.ix['F']
        x.ix[i,'F_Below'] = below.ix['F']
    return x

df = pd.DataFrame({'Team':['A','A','B','B','A','B','A'], 'Gender':['M','F','M','M','M','F','M'],
                   'Age':[22,25,33,18,56,28,33], 'Name':['Sam','Annie','Fred','James','Alan','Julie','Greg']}).sort_values(['Team','Age'])
times = []
times2 = []

for i in range(1000):
    tic = time()

    for idx, data in df.groupby(['Team'], as_index=False):
        m_tot = data['Gender'].value_counts()[0]    # number of males in current team
        f_tot = data['Gender'].value_counts()[1]    # dido^ (females)
        m_seen = 0                                  # males seen so far for current team
        f_seen = 0                                  # dido^ (females)

        for row in data.iterrows():
            (M_Above, M_below, F_Above, F_Below) = (m_seen, m_tot-m_seen, f_seen, f_tot-f_seen)
            if row[1].Gender == 'M':
                m_seen += 1
                M_below -= 1
            else:
                f_seen += 1
                F_Below -= 1

            df.loc[row[0],'M_Above'] = M_Above
            df.loc[row[0],'M_Below'] = M_below
            df.loc[row[0],'F_Above'] = F_Above
            df.loc[row[0],'F_Below'] = F_Below

    toc = time()
    times.append(toc-tic)

for i in range(1000):
    tic = time()

    df1 = df.groupby('Team', sort=False).apply(f).fillna(0)
    df1.ix[:,'M_Above':] = df1.ix[:,'M_Above':].astype(int)

    toc = time()
    times2.append(toc-tic)

print(np.mean(times))
print(np.mean(times2))

结果：

0.0163134906292  # alternative solution
0.0622982912064  # approved solution

Answer 2

您可以按f列groupby应用自定义功能Team。

在函数f中，对于每一行，首先按ix上下过滤值，然后drop值，并按value_counts获取所需值。缺少某些值，因此需要reindex，然后按ix选择：

def f(x):
    for i,d in x.iterrows():
        above = x.ix[:i, 'Gender'].drop(i).value_counts().reindex(['M','F'])
        below = x.ix[i:, 'Gender'].drop(i).value_counts().reindex(['M','F'])
        x.ix[i,'M_Above'] = above.ix['M']
        x.ix[i,'M_Below'] = below.ix['M']
        x.ix[i,'F_Above'] = above.ix['F']
        x.ix[i,'F_Below'] = below.ix['F']
    return x

df1 = df.groupby('Team', sort=False).apply(f).fillna(0)
#cast float to int
df1.ix[:,'M_Above':] = df1.ix[:,'M_Above':].astype(int)
print (df1)
   Age Gender   Name Team  M_Above  M_Below  F_Above  F_Below
0   22      M    Sam    A        0        2        0        1
1   25      F  Annie    A        1        2        0        0
6   33      M   Greg    A        1        1        1        0
4   56      M   Alan    A        2        0        1        0
3   18      M  James    B        0        1        0        1
5   28      F  Julie    B        1        1        0        0
2   33      M   Fred    B        1        0        1        0

在一个组中排序并添加一列指示

2 个答案: