使用自定义函数在DataFrame中聚合列表列

时间:2017-06-16 12:24:54

标签: python pandas

任务

我想自定义聚合我的DataFrame

import numpy as np
df = pd.DataFrame({'a': [1,1,1,2,2], 'b': [[(1,2,3),(4,5),(6,)],[(7,8),(9,10)],np.NaN,[(11,12),(13,)],np.NaN], 'c': [1,2,3,4,5]})

   a                          b  c
0  1  [(1, 2, 3), (4, 5), (6,)]  1
1  1          [(7, 8), (9, 10)]  2
2  1                        NaN  3
3  2          [(11, 12), (13,)]  4
4  2                        NaN  5

使b列中的列表每组相互扩展。结果应为

pd.DataFrame({'a': [1,2], 'b': [[(1,2,3),(4,5),(6,),(7,8),(9,10)],[(11,12),(13,)]], 'c': [6,9]})

   a                                           b  c
0  1  [(1, 2, 3), (4, 5), (6,), (7, 8), (9, 10)]  6
1  2                           [(11, 12), (13,)]  9

尝试解决方案

我和

一起去了
def mylistaggregator(l):
    return [item for sublist in l.tolist() for item in sublist]

df. \
    groupby('a', sort=False). \
    agg({'b': mylistaggregator,
         'c': 'sum'})

但是

TypeError: 'float' object is not iterable

并且不确定解决方案是什么。我也用lambda修饰,但没有到达任何地方。

其他信息

运行

types = []
for i in df.b:
    types.append(str(type(i)))
np.unique(types)

我的实际数据集返回

array(["<class 'float'>", "<class 'list'>"], 
      dtype='<U15')

1 个答案:

答案 0 :(得分:1)

您需要过滤掉NaN s:

def mylistaggregator(l):
    return ([item for sublist in l.tolist() if isinstance(sublist,list) for item in sublist])

或者:

def mylistaggregator(l):
    return([item for subl in l.tolist() if not isinstance(subl, float) for item in subl])



df1 = df. \
    groupby('a', sort=False). \
    agg({'b': mylistaggregator,
         'c': 'sum'})

print (df1)
                                            b  c
a                                               
1  [(1, 2, 3), (4, 5), (6,), (7, 8), (9, 10)]  6
2                           [(11, 12), (13,)]  9

另一种解决方案是将NaN替换为[]

def mylistaggregator(l):
    return ([item for sublist in l.tolist() for item in sublist])

s = pd.Series([[]], index=df.index)
df['b'] = df['b'].combine_first(s)
#or
#df['b'] = df['b'].fillna(s)

df1 = df. \
    groupby('a', sort=False). \
    agg({'b': mylistaggregator,
         'c': 'sum'})

print (df1)
                                            b  c
a                                               
1  [(1, 2, 3), (4, 5), (6,), (7, 8), (9, 10)]  6
2                           [(11, 12), (13,)]  9