比较PandaS DataFrames并返回第一个丢失的行

时间:2015-10-26 15:39:52

标签: python pandas dataframe

我有2个dataFrames并希望比较它们并返回第一个(df1)中不在第二个(df2)中的行。我找到了一种方法来比较它们并返回差异,但无法弄清楚如何从df1中仅返回缺失的那些。

import pandas as pd
from pandas import Series, DataFrame

df1 = pd.DataFrame( { 
"City" : ["Chicago", "San Franciso", "Boston"] , 
"State" : ["Illinois", "California", "Massachusett"] } )

df2 = pd.DataFrame( { 
"City" : ["Chicago",  "Mmmmiami", "Dallas" , "Omaha"] , 
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )



df = pd.concat([df1, df2])
df = df.reset_index(drop=True)

df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
blah = df.reindex(idx)

4 个答案:

答案 0 :(得分:6)

以@ EdChum的建议为基础:

df = pd.merge(df1, df2, how='outer', indicator=True)
rows_in_df1_not_in_df2 = df[df['_merge']=='left_only'][df1.columns]

rows_in_df1_not_in_df2

|Index |City        |State       |
|------|------------|------------|
|1     |San Franciso|California  |
|2     |Boston      |Massachusett|

答案 1 :(得分:5)

IIUC如果您使用的是pandas版本ITEM_ID COLOUR JUST_BEFORE_ROOT COLOUR_1 "W0" "Red" "W1" "blue" "W1" "blue" "W2" "Grey" "W1" "blue" "W3" "Black" "W1" "blue" "W4" "Mauve" "W1" "blue" "W5" "Orange" "W1" "blue" "W6" "Green" "W6" "Green" "W7" "Grey" "W6" "Green" "W8" "Pink" "W8" "Pink" ,那么您可以使用merge并设置0.17.0

indicator=True

这会添加一个列来指示行是仅存在于lhs还是rhs

答案 2 :(得分:0)

如果您对熊猫< pandas< 0.17.0或更新

你可以按照

的方式工作
In [182]: df = pd.merge(df1, df2, on='City', how='outer')

In [183]: df
Out[183]:
           City       State_x   State_y
0       Chicago      Illinois  Illinois
1  San Franciso    California       NaN
2        Boston  Massachusett       NaN
3      Mmmmiami           NaN   Florida
4        Dallas           NaN     Texas
5         Omaha           NaN  Nebraska

In [184]: df.ix[df['State_y'].isnull(),:]
Out[184]:
           City       State_x State_y
1  San Franciso    California     NaN
2        Boston  Massachusett     NaN

答案 3 :(得分:0)

您还可以使用列表推导并比较行以返回缺少的元素:

dif_list = [x for x in list(df1['City'].unique()) if x not in list(df2['City'].unique())]

返回:

['San Franciso', 'Boston']

然后,您可以获取仅包含不同行的数据框:

dfdif = df1[(df1['City'].isin(dif_list))]