如何在熊猫中串联列表类型列的行数?例如,见下面-
之前,
1 a [a,b,c]
1 b [a,d]
之后
1 b [a,b,c,d]
我按如下所示进行了列级列表的连接,
df['all_poi'] = df['poi_part1'] + df['poi_part2']
当前输出
location_id city all_poi
6265981 Port Severn [Mount St. Louis Moonstone , Horseshoe Valley , Lake Muskoka]
6265981 Port Severn [Mount St. Louis Moonstone , Little Lake Park , Bamboo Spa , Lake Huron]
预期产量
location_id city all_poi
6265981 Port Severn [Mount St. Louis Moonstone , Horseshoe Valley , Lake Muskoka, Little Lake Park , Bamboo Spa , Lake Huron]
检查all_poi值,它根据 location_id
合并该值答案 0 :(得分:3)
您可以在GroupBy.agg
的自定义函数中创建集合:
f = lambda x: list(set(z for y in x for z in y))
df = df.groupby(['location_id', 'city'])['all_poi'].agg(f).reset_index()
print (df)
location_id city all_poi
0 Port Severn [Bamboo Spa, Mount St.Louis Moonstone, Lake Hu...
如果顺序和性能很重要,请使用dict
删除重复项:
f = lambda x: list(dict.fromkeys([z for y in x for z in y]).keys())
另一个想法是使用unique
:
f = lambda x: pd.unique([z for y in x for z in y]).tolist()
编辑:
如果有多个列,并且每个组需要第一个值:
df.groupby('location_id').agg({'city': 'first', 'all_poi': f}).reset_index()
如果需要其他聚合方法,例如sum
,mean
,join
:
df.groupby('location_id').agg({'city': 'first',
'all_poi': f,
'cols1':'sum',
'vals': ','.join,
'vals1': lambda x: list(x)}).reset_index()
答案 1 :(得分:0)
简单的sum()
:
res=df.groupby(["location_id"], as_index=False).agg({"city": "last", "all_poi": "sum"})
res["all_poi"]=res["all_poi"].map(set)
输出:
Before
location_id ... all_poi
0 6265981 ... [Mount St. Louis Moonstone, Horseshoe Valley, Lake Muskoka]
1 6265981 ... [Mount St. Louis Moonstone, Little Lake Park, Bamboo Spa, Lake Huron]
After:
location_id ... all_poi
0 6265981 ... {Horseshoe Valley, Lake Muskoka, Lake Huron, Bamboo Spa, Little Lake Park, Mount St. Louis Moonstone}
答案 2 :(得分:0)
看起来下面的答案看起来更紧凑,但是您可以将sum
与groupby一起使用以合并列表。然后创建一个集合以消除重复项,然后从set
转换为list
import pandas as pd
df = pd.DataFrame([['1' ,'New York', ['a','b','c']], ['1', 'New York', ['a','d']]],
columns = ['location_id', 'city','all_poi'])
df.groupby(('location_id'))['all_poi'].apply(sum).apply(set).apply(list)