有一个名为all_combinations的数据框
Name Store_Code Town PostC Revenue Street
Starbucks 6890 Derby post1 1 Street_1
Starbucks 6891 Derby 0.5 NaN
Starbucks NaN Derby post6 NaN Street_2
Starbucks 6892 Derby post2 0.9 Street_3
Starbucks 6893 Derby post3 2 Street_4
McDonalds 6890 Derby post1 1 Street_1
McDonalds 8890 Derby post4 2.8 Street_5
McDonalds 8890 London post5 1.7 Street_6
McDonalds NaN London post7 NaN Street_7
McDonalds 8888 London 2 Street_7
还有另一个称为有效的数据框
Name Store_Code Town PostC Revenue Street
Starbucks 6890 Derby post1 1 Street_1
Starbucks 6891 Derby 0.5 NaN
Starbucks 6892 Derby post2 0.9 Street_3
Starbucks 6893 Derby post3 2 Street_4
McDonalds 6890 Derby post1 1 Street_1
McDonalds 8890 Derby post4 2.8 Street_5
McDonalds 8890 London post5 1.7 Street_6
是否有优雅方式,我们可以找到这两个数据帧之间的行差异(在这种情况下无效),即
Name Store_Code Town PostC Revenue Street
Starbucks NaN Derby post6 NaN Street_2
McDonalds NaN London post7 NaN Street_7
McDonalds 8888 London 2 Street_7
答案 0 :(得分:1)
不是那么优雅,但我认为这应该有效:concat all_combinations
和valid
,然后删除所有重复项:
In [11]: all_valid = pd.concat([all_combinations, valid])
In [12]: all_valid[~(all_valid.duplicated() | all_valid.duplicated(take_last=True))]
Out[12]:
Name Store Town PostC Revenue Street
2 Starbucks NaN Derby post6 NaN Street_2
8 McDonalds NaN London post7 NaN Street_7
9 McDonalds 8888 London NaN 2 Street_7
两次.duplicated()
是删除副本的第一次和第二次出现。
使用(更优雅)all_combinations[~all_combination.isin(valid).all()]
的问题是,这也检查了索引标签的相等性(我认为这不是我想要的)。
答案 1 :(得分:0)
是的。这样的事情应该有效:
invalid = set(all_combinations.Store_Code) - set(valid.Store_code)
all_combinations[all_combinations.Store_Code.isin(invalid) | (df.Store_Code.isnull())]
假设Store_Code
是唯一的且np.nan
Store_Code
无效
使用numpy函数:
invalid = np.setdiff1d(all_combinations.Store_Code, valid.Store_Code)