熊猫合并具有相同ID行的列表

时间:2020-09-04 09:19:18

标签: python pandas

如何在熊猫中串联列表类型列的行数?例如,见下面-

之前,

1  a  [a,b,c]  
1  b  [a,d] 

之后

1  b  [a,b,c,d]

我按如下所示进行了列级列表的连接,

df['all_poi'] = df['poi_part1'] + df['poi_part2']

当前输出

location_id  city            all_poi
6265981     Port Severn     [Mount St. Louis Moonstone , Horseshoe Valley , Lake Muskoka]
6265981     Port Severn     [Mount St. Louis Moonstone ,  Little Lake Park , Bamboo Spa , Lake Huron]

预期产量

location_id    city             all_poi
6265981     Port Severn     [Mount St. Louis Moonstone , Horseshoe Valley , Lake Muskoka, Little Lake Park , Bamboo Spa , Lake Huron]

检查all_poi值,它根据 location_id

合并该值

3 个答案:

答案 0 :(得分:3)

您可以在GroupBy.agg的自定义函数中创建集合:

f = lambda x: list(set(z for y in x for z in y))
df = df.groupby(['location_id', 'city'])['all_poi'].agg(f).reset_index()
print (df)
  location_id    city                                            all_poi
0        Port  Severn  [Bamboo Spa, Mount St.Louis Moonstone, Lake Hu...

如果顺序和性能很重要,请使用dict删除重复项:

f = lambda x: list(dict.fromkeys([z for y in x for z in y]).keys())

另一个想法是使用unique

f = lambda x: pd.unique([z for y in x for z in y]).tolist()

编辑:

如果有多个列,并且每个组需要第一个值:

df.groupby('location_id').agg({'city': 'first', 'all_poi': f}).reset_index()

如果需要其他聚合方法,例如summeanjoin

df.groupby('location_id').agg({'city': 'first', 
                               'all_poi': f, 
                               'cols1':'sum', 
                               'vals': ','.join, 
                               'vals1': lambda x: list(x)}).reset_index()

答案 1 :(得分:0)

简单的sum()

res=df.groupby(["location_id"], as_index=False).agg({"city": "last", "all_poi": "sum"})
res["all_poi"]=res["all_poi"].map(set)

输出:

Before
   location_id  ...                                                                all_poi
0  6265981      ...  [Mount St. Louis Moonstone, Horseshoe Valley, Lake Muskoka]
1  6265981      ...  [Mount St. Louis Moonstone, Little Lake Park, Bamboo Spa, Lake Huron]

After:
   location_id  ...                                                                                                all_poi
0  6265981      ...  {Horseshoe Valley, Lake Muskoka, Lake Huron, Bamboo Spa, Little Lake Park, Mount St. Louis Moonstone}

答案 2 :(得分:0)

看起来下面的答案看起来更紧凑,但是您可以将sum与groupby一起使用以合并列表。然后创建一个集合以消除重复项,然后从set转换为list

import pandas as pd

df = pd.DataFrame([['1' ,'New York', ['a','b','c']], ['1', 'New York', ['a','d']]],
                   columns = ['location_id', 'city','all_poi'])

df.groupby(('location_id'))['all_poi'].apply(sum).apply(set).apply(list)