假设数据集的子集包含这两列,
attacker_king attacker_commander
0 Joffrey/Tommen Baratheon Jaime Lannister
1 Joffrey/Tommen Baratheon Gregor Clegane
2 Joffrey/Tommen Baratheon Jaime Lannister, Andros Brax
3 Robb Stark Roose Bolton, Wylis Manderly, Medger Cerwyn
4 Robb Stark Robb Stark, Brynden Tully
5 Robb Stark Robb Stark, Tytos Blackwood, Brynden Tully
我的目标是让一套指挥官'根据数据集,每个国王都会部署。
[x for x in battles['attacker_commander'].dropna().str.split(',').sum()]
上述命令仅获取逗号分隔的指挥官列表 但如果我选择使用以下列表理解,
battles[['attacker_commander','attacker_king']].groupby('attacker_king').sum()
我得到一个输出
attacker_king attacker_commander
Balon/Euron Greyjoy Victarion GreyjoyAsha GreyjoyTheon GreyjoyTheo...
Joffrey/Tommen Baratheon Jaime LannisterGregor CleganeJaime Lannister, ...
Robb Stark Roose Bolton, Wylis Manderly, Medger Cerwyn, H...
Stannis Baratheon Stannis Baratheon, Davos SeaworthStannis Barat...
这种方法的问题是,假设一行只有1个指挥官,当它与下一行相加时,输出可能看起来像Victarion GreyjoyAsha Greyjoy'而不是Victarion Greyjoy,Asha Greyjoy'。因此,使用
创建的列表是有意义的[x for x in battles['attacker_commander'].dropna().str.split(',').sum()]
并将其提供给groupby(' attacker_king')或者你们建议采用什么方法?
答案 0 :(得分:3)
我认为首先需要apply
功能join
:
battles.groupby('attacker_king')['attacker_commander'].apply(','.join)
如果需要删除NaN
:
battles.groupby('attacker_king')['attacker_commander'].apply(lambda x: ','.join(x.dropna()))
然后split
并使用set
获取唯一值:
df = battles.groupby('attacker_king')['attacker_commander']
.apply(lambda x: list(set(','.join(x.dropna()).split(','))))
print (df)
调试的最佳解决方案是使用自定义函数,然后将代码重写为lambda
:
def f(x):
#Series by attacker_commander per group
print (x)
#first remove NaN
print (x.dropna())
#join by ,
print (','.join(x.dropna()))
#create list by split
print (','.join(x.dropna()).split(','))
#convert to set - unique values
print (set(','.join(x.dropna()).split(',')))
#set convert to list
print (list(set(','.join(x.dropna()).split(','))))
return list(set(','.join(x.dropna()).split(',')))
df = battles.groupby('attacker_king')['attacker_commander'].apply(f)
print (df)
但是,还有一个可行的解决方案是首先按DataFrame.dropna
列删除NaN
行:
def f(x):
return list(set(','.join(x).split(',')))
df = battles.dropna(subset=['attacker_commander']).groupby('attacker_king')['attacker_commander'].apply(f)
print (df)
答案 1 :(得分:1)
您希望按组加入字符串,然后拆分并找到唯一值。
df.groupby(
'attacker_king'
).attacker_commander.apply(','.join).str.split(',').apply(pd.unique)
attacker_king
Joffrey/Tommen Baratheon [Jaime Lannister, Gregor Clegane, Andros Brax]
Robb Stark [Roose Bolton, Wylis Manderly, Medger Cerwyn...
Name: attacker_commander, dtype: object