如何在大熊猫中进行groupby后从不同的列中提取值?

时间:2019-06-18 20:49:23

标签: python python-3.x pandas

我在csv中有以下输入文件:

输入

ID,GroupID,Person,Parent
ID_001,A001,John Doe,Yes
ID_002,A001,Mary Jane,No
ID_003,A001,James Smith;John Doe,Yes
ID_004,B003,Nathan Drake,Yes
ID_005,B003,Troy Baker,No

所需的输出如下:

**所需的输出**

ID,GroupID,Person
ID_001,A001,John Doe;Mary Jane;James Smith
ID_003,A001,John Doe;Mary Jane;James Smith
ID_004,B003,Nathan Drake;Troy Baker

基本上,我想按相同的GroupID分组,然后将“人员”列中存在的属于该组的所有值串联起来。然后,在我的输出中,对于每个组,我想返回“父”列为“是”的那些行的ID,组ID和每个组的并列人员值。

我可以连接特定组的所有人员值,并从输出中的人员列中删除任何重复的值。这是我到目前为止的内容:

import pandas as pd

inputcsv = path to the input csv file
outputcsv = path to the output csv file

colnames = ['ID', 'GroupID', 'Person', 'Parent']
df1 = pd.read_csv(inputcsv, names = colnames, header = None, skiprows = 1)

#First I do a groupby on GroupID, concatenate the values in the Person column, and finally remove the duplicate person values from the output before saving the df to a csv.

df2 = df1.groupby('GroupID')['Person'].apply(';'.join).str.split(';').apply(set).apply(';'.join).reset_index()

df2.to_csv(outputcsv, sep=',', index=False)

这将产生以下输出:

GroupID,Person
A001,John Doe;Mary Jane;James Smith
B003,Nathan Drake;Troy Baker

我不知道如何包括ID列,以及如何将组中的所有行包括在Parent为“是”的组中(如上面所需的输出所示)。

1 个答案:

答案 0 :(得分:1)

IIUC

df.Person=df.Person.str.split(';')#1st split the string to list 

df['Person']=df.groupby(['GroupID']).Person.transform(lambda x : ';'.join(set(sum(x,[]))))# then we do transform , this will add each group rowwise same result , link https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object
df=df.loc[df.Parent.eq('Yes')] # then using Parent to filter
df
Out[239]: 
       ID GroupID                          Person Parent
0  ID_001    A001  James Smith;John Doe;Mary Jane    Yes
2  ID_003    A001  James Smith;John Doe;Mary Jane    Yes
3  ID_004    B003         Troy Baker;Nathan Drake    Yes