合并具有相同列名和不同行大小的DF,在重复行中填充NAN

时间:2019-12-23 17:31:28

标签: python pandas dataframe

出于这个问题的目的,我生成了以下两个生成的DataFrame:

df1 = pd.DataFrame({"model": [f"model{i//2}" for i in range(6)], "label": [f"label_{i}" for i in range(6)], "data": [f"data_{i}" for i in range(6)]})
df1 = df1.set_index("model")

df2 = pd.DataFrame({"model": [f"model{i}" for i in range(3)], "info": [f"info_{i}" for i in range(3)], "stuff": [f"stuff_{i}" for i in range(3)]})
df2 = df2.set_index("model")

df1看起来像这样:

[model]  label   data   
model0  label_0 data_0
model0  label_1 data_1
model1  label_2 data_2
model1  label_3 data_3
model2  label_4 data_4
model2  label_5 data_5

df2如下:

[model]  info    stuff  
model0  info_0  stuff_0
model1  info_1  stuff_1
model2  info_2  stuff_2

[...]表示数据帧的索引。我希望以某种方式将这两个DataFrame都加入以输出以下内容;

[model]  info    stuff  label   data   
model0  info_0  stuff_0 label_0 data_0
model0    NAN     NAN   label_1 data_1
model1  info_1  stuff_1 label_2 data_2
model1    NAN     NAN   label_3 data_3
model2  info_2  stuff_2 label_4 data_4
model2    NAN     NAN   label_5 data_5

我似乎找不到有关上述操作方法的任何文档。我曾尝试使用joinconcatmerge进行多种代码组合,但以上均未得到结果。我知道我可以编写一个函数来执行此操作,但是我希望可以通过Pandas原生joinconcatmerge函数来实现此功能。

如果对pandas有更多经验的人可以引导我朝正确的方向前进,我将不胜感激!

2 个答案:

答案 0 :(得分:2)

首先,我们重置索引,以便我们可以合并model列上的两个数据帧。然后,您可以使用duplicated中的pd.Series方法来掩盖重复项,然后用NaN填充重复项:

df1 = df1.reset_index(drop=False)
df2 = df2.reset_index(drop=False)
df_new = pd.merge(df1,df2, how='outer')
df_new = df_new.set_index('model')
is_duplicated = df_new.apply(pd.Series.duplicated, axis=0)
df_new = df_new.where(~is_duplicated, np.nan)

新数据帧df_new是所需的结果。

答案 1 :(得分:2)

这是另一种方法:

import pandas as pd

df1 = pd.DataFrame({"model": [f"model{i//2}" for i in range(6)], "label": [f"label_{i}" for i in range(6)], "data": [f"data_{i}" for i in range(6)]})
df1 = df1.set_index("model")

df2 = pd.DataFrame({"model": [f"model{i}" for i in range(3)], "info": [f"info_{i}" for i in range(3)], "stuff": [f"stuff_{i}" for i in range(3)]})
df2 = df2.set_index("model")

df1_g = df1.groupby(by='model').first()
print(pd.concat([df1_g, df2], axis=1).append( df1[~df1.isin(df1_g)].dropna(), sort=False ).sort_index() )

打印:

          label    data    info    stuff
model                                   
model0  label_0  data_0  info_0  stuff_0
model0  label_1  data_1     NaN      NaN
model1  label_2  data_2  info_1  stuff_1
model1  label_3  data_3     NaN      NaN
model2  label_4  data_4  info_2  stuff_2
model2  label_5  data_5     NaN      NaN