如何将列表作为输入提供给pandas数据帧中的groupby函数

时间:2017-02-13 07:03:48

标签: python-3.x pandas analytics

假设数据集的子集包含这两列,

     attacker_king              attacker_commander
0   Joffrey/Tommen Baratheon    Jaime Lannister
1   Joffrey/Tommen Baratheon    Gregor Clegane
2   Joffrey/Tommen Baratheon    Jaime Lannister, Andros Brax
3   Robb Stark                  Roose Bolton, Wylis Manderly, Medger Cerwyn
4   Robb Stark                  Robb Stark, Brynden Tully
5   Robb Stark                  Robb Stark, Tytos Blackwood, Brynden Tully

我的目标是让一套指挥官'根据数据集,每个国王都会部署。

[x for x in battles['attacker_commander'].dropna().str.split(',').sum()]

上述命令仅获取逗号分隔的指挥官列表 但如果我选择使用以下列表理解,

battles[['attacker_commander','attacker_king']].groupby('attacker_king').sum()

我得到一个输出

attacker_king                      attacker_commander   
Balon/Euron Greyjoy         Victarion GreyjoyAsha GreyjoyTheon GreyjoyTheo...
Joffrey/Tommen Baratheon    Jaime LannisterGregor CleganeJaime Lannister, ...
Robb Stark                  Roose Bolton, Wylis Manderly, Medger Cerwyn, H...
Stannis Baratheon           Stannis Baratheon, Davos SeaworthStannis Barat...

这种方法的问题是,假设一行只有1个指挥官,当它与下一行相加时,输出可能看起来像Victarion GreyjoyAsha Greyjoy'而不是Victarion Greyjoy,Asha Greyjoy'。因此,使用

创建的列表是有意义的
[x for x in battles['attacker_commander'].dropna().str.split(',').sum()]

并将其提供给groupby(' attacker_king')或者你们建议采用什么方法?

2 个答案:

答案 0 :(得分:3)

我认为首先需要apply功能join

battles.groupby('attacker_king')['attacker_commander'].apply(','.join)

如果需要删除NaN

battles.groupby('attacker_king')['attacker_commander'].apply(lambda x: ','.join(x.dropna()))

然后split并使用set获取唯一值:

df = battles.groupby('attacker_king')['attacker_commander']
            .apply(lambda x: list(set(','.join(x.dropna()).split(','))))
print (df)

调试的最佳解决方案是使用自定义函数,然后将代码重写为lambda

def f(x):
    #Series by attacker_commander per group
    print (x)
    #first remove NaN
    print (x.dropna())
    #join by ,
    print (','.join(x.dropna()))
    #create list by split
    print (','.join(x.dropna()).split(','))
    #convert to set - unique values
    print (set(','.join(x.dropna()).split(',')))
    #set convert to list
    print (list(set(','.join(x.dropna()).split(','))))
    return list(set(','.join(x.dropna()).split(',')))

df = battles.groupby('attacker_king')['attacker_commander'].apply(f)
print (df)

但是,还有一个可行的解决方案是首先按DataFrame.dropna列删除NaN行:

def f(x):
    return list(set(','.join(x).split(',')))

df = battles.dropna(subset=['attacker_commander']).groupby('attacker_king')['attacker_commander'].apply(f)
print (df)

答案 1 :(得分:1)

您希望按组加入字符串,然后拆分并找到唯一值。

df.groupby(
    'attacker_king'
).attacker_commander.apply(','.join).str.split(',').apply(pd.unique)

attacker_king
Joffrey/Tommen Baratheon      [Jaime Lannister, Gregor Clegane,  Andros Brax]
Robb Stark                  [Roose Bolton,  Wylis Manderly,  Medger Cerwyn...
Name: attacker_commander, dtype: object