aws s3文件夹中有大量文件。我想从python中的每个文件夹中读取文件,然后比较并合并数据帧,以便如果特定的列值在另一个数据帧中相似,则将相应的列值添加到列表中,并将所有其他列添加到下一行
例如:数据帧df1
A B
books [book1, book2, book3]
animal [animal1, animal2, animal3]
place [place1 , place2, place3]
dataframe df2
A . B
name [name1, name2, name3]
animal [animal 5, animal 6]
那么结果应该是:df
A B
books [book1, book2, book3]
animal [animal1, animal2, animal3, animal5, animal6]
place [place1 , place2, place3]
name [name1, name2, name3]
答案 0 :(得分:0)
通过concat
加入所有DataFrame,并使用Lambda函数中的列表列表的扁平列表来汇总B
的值:
df = pd.concat([df1, df2], ignore_index=True)
f = lambda x: [z for y in x for z in y]
df3 = df.groupby('A', sort=False)['B'].apply(f).reset_index()
print (df3)
A B
0 books [book1, book2, book3]
1 animal [animal1, animal2, animal3, animal 5, animal 6]
2 place [place1, place2, place3]
3 name [name1, name2, name3]
或者:
#slow solution in large data
df = df.groupby('A', sort=False)['B'].sum().reset_index()