我有2个dataFrames并希望比较它们并返回第一个(df1)中不在第二个(df2)中的行。我找到了一种方法来比较它们并返回差异,但无法弄清楚如何从df1中仅返回缺失的那些。
import pandas as pd
from pandas import Series, DataFrame
df1 = pd.DataFrame( {
"City" : ["Chicago", "San Franciso", "Boston"] ,
"State" : ["Illinois", "California", "Massachusett"] } )
df2 = pd.DataFrame( {
"City" : ["Chicago", "Mmmmiami", "Dallas" , "Omaha"] ,
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )
df = pd.concat([df1, df2])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
blah = df.reindex(idx)
答案 0 :(得分:6)
以@ EdChum的建议为基础:
df = pd.merge(df1, df2, how='outer', indicator=True)
rows_in_df1_not_in_df2 = df[df['_merge']=='left_only'][df1.columns]
rows_in_df1_not_in_df2
|Index |City |State |
|------|------------|------------|
|1 |San Franciso|California |
|2 |Boston |Massachusett|
答案 1 :(得分:5)
IIUC如果您使用的是pandas版本ITEM_ID COLOUR JUST_BEFORE_ROOT COLOUR_1
"W0" "Red"
"W1" "blue" "W1" "blue"
"W2" "Grey" "W1" "blue"
"W3" "Black" "W1" "blue"
"W4" "Mauve" "W1" "blue"
"W5" "Orange" "W1" "blue"
"W6" "Green" "W6" "Green"
"W7" "Grey" "W6" "Green"
"W8" "Pink" "W8" "Pink"
,那么您可以使用merge
并设置0.17.0
:
indicator=True
这会添加一个列来指示行是仅存在于lhs还是rhs
中答案 2 :(得分:0)
如果您对熊猫< pandas< 0.17.0或更新
你可以按照
的方式工作In [182]: df = pd.merge(df1, df2, on='City', how='outer')
In [183]: df
Out[183]:
City State_x State_y
0 Chicago Illinois Illinois
1 San Franciso California NaN
2 Boston Massachusett NaN
3 Mmmmiami NaN Florida
4 Dallas NaN Texas
5 Omaha NaN Nebraska
In [184]: df.ix[df['State_y'].isnull(),:]
Out[184]:
City State_x State_y
1 San Franciso California NaN
2 Boston Massachusett NaN
答案 3 :(得分:0)
您还可以使用列表推导并比较行以返回缺少的元素:
dif_list = [x for x in list(df1['City'].unique()) if x not in list(df2['City'].unique())]
返回:
['San Franciso', 'Boston']
然后,您可以获取仅包含不同行的数据框:
dfdif = df1[(df1['City'].isin(dif_list))]