I want to group-by three columns, and then find the mean of a fourth numerical column for all rows which are duplicated across the first three columns. I can achieve this with the following function:
df2 = df.groupby(['col1', 'col2', 'col3'], as_index=False)['col4'].mean()
The problem is that I also want a fifth column which will aggregate for all rows grouped by the groupby function, which I don't know how to do on top of the previous function. For example:
df
index col1 col2 col3 col4 col5
0 Week_1 James John 1 when and why?
1 Week_1 James John 3 How?
2 Week_2 James John 2 Do you know when?
3 Week_2 Mark Jim 3 What time?
4 Week_2 Andrew Simon 1 How far is it?
5 Week_2 Andrew Simon 2 Are you going?
CURRENT(with above function):
index col1 col2 col3 col4
0 Week_1 James John 2
1 Week_2 James John 2
2 Week_2 Mark Jim 3
3 Week_2 Andrew Simon 1.5
DESIRED:
index col1 col2 col3 col4 col5
0 Week_1 James John 2 when and why?, How?
2 Week_2 James John 2 Do you know when?
3 Week_2 Mark Jim 3 What time?
4 Week_2 Andrew Simon 1.5 How far is it?, Are you going?
I've tried here and here, however the .mean() function I'm using complicates the process. Any help would be appreciated. (If possible, I would like to specify a custom delimiter separating the strings of col5 when aggregated).
答案 0 :(得分:2)
You can define for each column aggregate function:
df2=df.groupby(['col1','col2','col3'], as_index=False).agg({'col4':'mean', 'col5':','.join})
print (df2)
col1 col2 col3 col4 col5
0 Week_1 James John 2.0 when and why?,How?
1 Week_2 Andrew Simon 1.5 How far is it?,Are you going?
2 Week_2 James John 2.0 Do you know when?
3 Week_2 Mark Jim 3.0 What time?
General solution is numeric columns aggregate by mean
and other by join
:
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else ', '.join(x)
df2 = df.groupby(['col1', 'col2', 'col3'], as_index=False).agg(f)
print (df2)
col1 col2 col3 col4 col5
0 Week_1 James John 2.0 when and why?, How?
1 Week_2 Andrew Simon 1.5 How far is it?, Are you going?
2 Week_2 James John 2.0 Do you know when?
3 Week_2 Mark Jim 3.0 What time?
答案 1 :(得分:1)
df = pd.DataFrame({
'col1':['a','a','b','b'],
'col2':[1,2,1,1],
'col3':['str1','str2','str3','str4']
})
result = df.groupby(['col1','col2'])['col3'].apply(lambda x:','.join(list(x)))