在Df1的索引上组合数据集熊猫

时间:2017-07-12 15:40:38

标签: python pandas merge

我有一个数据集(df1),我想填充第二个数据集(df2)中的数据。两个数据帧中只有一列重叠,我将该列设置为df1和df2的索引所以我可以合并索引。

df = pd.read_excel('Data.xlsx', sheetname= 'Dataset1')
df2 = pd.read_excel('Data.xlsx', sheetname= 'Dataset2')
df1.set_index("ORG_ID", inplace=True)
df2.set_index("ORG_ID", inplace=True)
df3 = df1.merge(df2.ix[:,df2.columns-df1.columns], left_index=True, right_index=True, how="outer")

我希望输出的是一个新的数据集(df3),它列出了来自df1的所有数据,包括索引(ORG_IDs),并包括来自df2的所有新列,其中包含基于列出的ORG_ID的填充数据在df1。 python在这里做的是给我一个新的数据帧(df3),填入df1的数据,然后从df1的ORG_IDs下面的第二个数据集(df2)中添加所有Org_ids,这不是我想要的。

我也尝试过使用combine_first,但它似乎产生了类似的结果。

df3= df1.combine_first(df2)


Dataset1 (df1)
ORG_ID  COUNTRY TOWN    STORE   PRODUCT PRICE
1   Spain   Madrid  Pink    Garment 100
2   Greece  Chania  White   Toy 200
3   U.K Manchester  Red Garment 300
4   Italy   Rome    Red Accessory   500
5   Spain   Marbella    Blue    Accessory   20
6   Greece  Chania  Green   Garment 25
7   U.K Manchester  Pink    Toy 36
8   Italy   Siena   Red Accessory   150
9   Spain   Barcelona   White   Toy 200
10  Greece  Corfu   Blue    Accessory   500

数据集2(df2)

ORG_ID  CUSTOMER    TYPE    PARENT  REGION
5   A   Pop Rose    Europe
10  A   Cry Tulip   Europe
24  C   Fig Lily    Europe
89  G   Pop Rose    Europe
6   R   Fig Lily    Europe
4   Y   Pop Rose    Europe
1   T   Fig Tulip   Europe
7   H   Pop Tulip   Europe
8   S   Fig Rose    Europe

数据集3(df3) - 我想要的是什么

ORG_ID  COUNTRY TOWN    STORE   PRODUCT PRICE   CUSTOMER    TYPE    PARENT  REGION
1   Spain   Madrid  Pink    Garment 100 T   Fig Tulip   Europe
2   Greece  Chania  White   Toy 200 NaN NaN NaN NaN
3   U.K Manchester  Red Garment 300 NaN NaN NaN NaN
4   Italy   Rome    Red Accessory   500 Y   Pop Rose    Europe
5   Spain   Marbella    Blue    Accessory   20  A   Pop Rose    Europe
6   Greece  Chania  Green   Garment 25  R   Fig Lily    Europe
7   U.K Manchester  Pink    Toy 36  H   Pop Tulip   Europe
8   Italy   Siena   Red Accessory   150 S   Fig Rose    Europe
9   Spain   Barcelona   White   Toy 200 NaN NaN NaN NaN
10  Greece  Corfu   Blue    Accessory   500 A   Cry Tulip   Europe

1 个答案:

答案 0 :(得分:2)

您的数据源中没有set_index。您可以将mergeon参数和how='left'一起使用。

df1 = pd.read_excel('Data.xlsx', sheetname= 'Dataset1')
df2 = pd.read_excel('Data.xlsx', sheetname= 'Dataset2')

df3 = df1.merge(df2, how='left', on='ORG_ID')

输出:

   ORG_ID COUNTRY        TOWN  STORE    PRODUCT  PRICE CUSTOMER TYPE PARENT  \
0       1   Spain      Madrid   Pink    Garment    100        T  Fig  Tulip   
1       2  Greece      Chania  White        Toy    200      NaN  NaN    NaN   
2       3     U.K  Manchester    Red    Garment    300      NaN  NaN    NaN   
3       4   Italy        Rome    Red  Accessory    500        Y  Pop   Rose   
4       5   Spain    Marbella   Blue  Accessory     20        A  Pop   Rose   
5       6  Greece      Chania  Green    Garment     25        R  Fig   Lily   
6       7     U.K  Manchester   Pink        Toy     36        H  Pop  Tulip   
7       8   Italy       Siena    Red  Accessory    150        S  Fig   Rose   
8       9   Spain   Barcelona  White        Toy    200      NaN  NaN    NaN   
9      10  Greece       Corfu   Blue  Accessory    500        A  Cry  Tulip   

   REGION  
0  Europe  
1     NaN  
2     NaN  
3  Europe  
4  Europe  
5  Europe  
6  Europe  
7  Europe  
8     NaN  
9  Europe