替换由pandas合并产生的缺失值

时间:2016-10-20 18:22:43

标签: python pandas merge

df1

|Invoice #  |Date        |Amount       
|12         |12/15/2015  |$10 
|13         |12/16/2015  |$11 
|14         |12/17/2015  |$12 

df2

|Invoice #  |Date        |Amount
|12         |1/16/2016   |$10 
|14         |1/17/2016   |$12 

Merged = df1.merge(df2,how = left,on = Invoice#)

|Invoice #  |Date         |Amount
|12         |12/15/2015   |$10
|NaN        |NaN          |NaN
|14         |1/17/2016    |$12

我想做的是使用Invoice 13在合并中返回NaN值并将其放入列表中。有什么想法吗?

2 个答案:

答案 0 :(得分:1)

Your merged result is not showing what actually happens with a left merge?

Here's what I get when I try to reproduce what I think you're trying to do (I'm using pandas version 0.19.0):

merged = df1.merge(df2, how='left', on='Invoice #')

merged

Then you can mask by the missing values and get a dataframe containing those rows:

merged[merged['Amount_y'].isnull()]

masked

Or just create a column with the boolean flag:

merged['missing_from_df2'] = merged['Amount_y'].isnull()

To select things from the masked dataframe, treat it like any other dataframe, and index into one or more columns by listing them (note that if you want more than one, you have to do double brackets).

select_columns

You can save it to a new variable to make the syntax simpler if you want to do other things with it.

masked_selection

答案 1 :(得分:0)

method 1
pd.concat + drop_duplicates

pd.concat([df1, df2]).drop_duplicates(subset=['Invoice #'])

method 2
combine_first

df1.set_index('Invoice #').combine_first(df2.set_index('Invoice #')).reset_index()

method 3
merge

df1.merge(df2, on='Invoice #', suffixes=['', '_'], how='left')[df1.columns]

method 4
join

df1.join(df2.set_index('Invoice #'), on='Invoice #', rsuffix='_')[df1.columns]

all produce

enter image description here


timing
pd.concat + drop_duplicates is the fastest

enter image description here