我想自定义聚合我的DataFrame
import numpy as np
df = pd.DataFrame({'a': [1,1,1,2,2], 'b': [[(1,2,3),(4,5),(6,)],[(7,8),(9,10)],np.NaN,[(11,12),(13,)],np.NaN], 'c': [1,2,3,4,5]})
a b c
0 1 [(1, 2, 3), (4, 5), (6,)] 1
1 1 [(7, 8), (9, 10)] 2
2 1 NaN 3
3 2 [(11, 12), (13,)] 4
4 2 NaN 5
使b
列中的列表每组相互扩展。结果应为
pd.DataFrame({'a': [1,2], 'b': [[(1,2,3),(4,5),(6,),(7,8),(9,10)],[(11,12),(13,)]], 'c': [6,9]})
a b c
0 1 [(1, 2, 3), (4, 5), (6,), (7, 8), (9, 10)] 6
1 2 [(11, 12), (13,)] 9
我和
一起去了def mylistaggregator(l):
return [item for sublist in l.tolist() for item in sublist]
df. \
groupby('a', sort=False). \
agg({'b': mylistaggregator,
'c': 'sum'})
但是
TypeError: 'float' object is not iterable
并且不确定解决方案是什么。我也用lambda修饰,但没有到达任何地方。
运行
types = []
for i in df.b:
types.append(str(type(i)))
np.unique(types)
我的实际数据集返回
array(["<class 'float'>", "<class 'list'>"],
dtype='<U15')
答案 0 :(得分:1)
您需要过滤掉NaN
s:
def mylistaggregator(l):
return ([item for sublist in l.tolist() if isinstance(sublist,list) for item in sublist])
或者:
def mylistaggregator(l):
return([item for subl in l.tolist() if not isinstance(subl, float) for item in subl])
df1 = df. \
groupby('a', sort=False). \
agg({'b': mylistaggregator,
'c': 'sum'})
print (df1)
b c
a
1 [(1, 2, 3), (4, 5), (6,), (7, 8), (9, 10)] 6
2 [(11, 12), (13,)] 9
另一种解决方案是将NaN
替换为[]
:
def mylistaggregator(l):
return ([item for sublist in l.tolist() for item in sublist])
s = pd.Series([[]], index=df.index)
df['b'] = df['b'].combine_first(s)
#or
#df['b'] = df['b'].fillna(s)
df1 = df. \
groupby('a', sort=False). \
agg({'b': mylistaggregator,
'c': 'sum'})
print (df1)
b c
a
1 [(1, 2, 3), (4, 5), (6,), (7, 8), (9, 10)] 6
2 [(11, 12), (13,)] 9