我不确定一种方法,甚至合并数据帧的做法是否可以实现我的意图 - 或者我是否需要使用for循环来编写自己的函数。
我希望逐步构建一个主数据框,其中包含来自具有可变列数据的多个较小数据帧的所有可能列值。所有数据帧都来自具有相同名称约定的记录,应避免重复使用相同名称的行
df_master = pd.DataFrame(columns=('Names','Age','Hair','Breakfast','Lunch','Dinner'))
df_lunch = pd.DataFrame([['Joe',16,'red','sandwich'],['Mary',22,'brown','carrot']],columns=('Names','Age','Hair','Lunch'))
df_ingredients = pd.DataFrame([['Joe','ham']],columns=('Names','Lunch',))
df_breakfast = pd.DataFrame([['Joe','fruit loops'],['Mary','toast']],columns=('Names','Breakfast',))
df_master = pd.merge(df_master, df_lunch, on=['Names','Age','Hair','Lunch'], how='outer')
到目前为止,这么好(除了列顺序有趣)
df_master = pd.merge(df_master, df_ingredients, on=['Names','Lunch'], how='outer')
乔已经获得了新的一排,他的火腿没有添加到他的三明治中
df_master = pd.merge(df_master, df_breakfast, on=['Names','Breakfast'], how='outer')
乔,玛丽有新行,只是为了容纳早餐
df_base = pd.DataFrame(columns=('Names','Age','Hair','Breakfast','Lunch','Dinner'))
df_sofar = pd.DataFrame([['Joe',16,'red','fruit loops', 'sandwich, ham'],['Mary',22,'brown','toast','carrot']],columns=('Names','Age','Hair','Breakfast','Lunch'))
df_ideal = pd.merge(df_base, df_sofar, on=['Names','Age','Hair','Breakfast','Lunch'], how='outer')
显示了我希望从2.看起来的最终数据框
Dinner Names Age Hair Breakfast Lunch
0 Joe 16 red fruit loops sandwich, ham
1 Mary 22 brown toast carrot
我是否认为这一切都错了?或者有什么明显的东西我不见了?谢谢!
答案 0 :(得分:2)
让我们试试concat
+ groupby
+ agg
:
df = pd.concat(
[df_master, df_lunch, df_ingredients, df_breakfast]
)
g = df.groupby('Names', sort=False, as_index=False).agg(lambda x: ','.join(x.dropna()))
g['Age'] = df_lunch['Age']
Names Breakfast Dinner Hair Lunch Age
0 Joe fruit loops red sandwich,ham 16
1 Mary toast brown carrot 22
另类
如果您将所有内容都转换为字符串,则在groupby
:
df = pd.concat(
[df_master, df_lunch, df_ingredients, df_breakfast]
)
df.groupby('Names', sort=False, as_index=False).agg(
lambda x: ','.join(x.dropna().astype(str))
)
Names Age Breakfast Dinner Hair Lunch
0 Joe 16.0 fruit loops red sandwich,ham
1 Mary 22.0 toast brown carrot