Question

I want to group-by three columns, and then find the mean of a fourth numerical column for all rows which are duplicated across the first three columns. I can achieve this with the following function:

df2 = df.groupby(['col1', 'col2', 'col3'], as_index=False)['col4'].mean()

The problem is that I also want a fifth column which will aggregate for all rows grouped by the groupby function, which I don't know how to do on top of the previous function. For example:

df 
index    col1        col2       col3       col4       col5
0        Week_1      James      John       1          when and why?
1        Week_1      James      John       3          How?
2        Week_2      James      John       2          Do you know when?
3        Week_2      Mark       Jim        3          What time?
4        Week_2      Andrew     Simon      1          How far is it?
5        Week_2      Andrew     Simon      2          Are you going?


CURRENT(with above function):
index    col1        col2       col3       col4
0        Week_1      James      John       2
1        Week_2      James      John       2
2        Week_2      Mark       Jim        3
3        Week_2      Andrew     Simon      1.5

DESIRED:
index    col1        col2       col3       col4       col5
0        Week_1      James      John       2          when and why?, How?
2        Week_2      James      John       2          Do you know when?
3        Week_2      Mark       Jim        3          What time?
4        Week_2      Andrew     Simon      1.5        How far is it?, Are you going?

I've tried here and here, however the .mean() function I'm using complicates the process. Any help would be appreciated. (If possible, I would like to specify a custom delimiter separating the strings of col5 when aggregated).

Answer 1

You can define for each column aggregate function:

df2=df.groupby(['col1','col2','col3'], as_index=False).agg({'col4':'mean', 'col5':','.join})
print (df2)
     col1    col2   col3  col4                           col5
0  Week_1   James   John   2.0             when and why?,How?
1  Week_2  Andrew  Simon   1.5  How far is it?,Are you going?
2  Week_2   James   John   2.0              Do you know when?
3  Week_2    Mark    Jim   3.0                     What time?

General solution is numeric columns aggregate by mean and other by join:

f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else ', '.join(x)
df2 = df.groupby(['col1', 'col2', 'col3'], as_index=False).agg(f)
print (df2)

     col1    col2   col3  col4                            col5
0  Week_1   James   John   2.0             when and why?, How?
1  Week_2  Andrew  Simon   1.5  How far is it?, Are you going?
2  Week_2   James   John   2.0               Do you know when?
3  Week_2    Mark    Jim   3.0                      What time?

Answer 2

df = pd.DataFrame({
        'col1':['a','a','b','b'],
        'col2':[1,2,1,1],
        'col3':['str1','str2','str3','str4']
        })

result = df.groupby(['col1','col2'])['col3'].apply(lambda x:','.join(list(x)))

Python - Group-by multiple columns with .mean() and .agg()

2 个答案: