我试图在两个数据帧中获得差异。因此,我想删除不同的记录数,并从中分离出单独的数据帧。我按照此处Comparing two dataframes and getting the differences的说明进行了操作:
train_abusive=pd.read_csv('train_abusive.csv',low_memory=False)
train_non_abusive=pd.read_csv('train_non_abusive.csv',low_memory=False)
print len(train_abusive),len(train_non_abusive)
val_abusive=train_abusive.sample(frac=0.1)
val_non_abusive=train_non_abusive.sample(frac=0.2)
train_abusive=pd.concat([val_abusive,train_abusive],ignore_index=True)
train_abusive=train_abusive.drop_duplicates(keep=False)
train_non_abusive=pd.concat([val_non_abusive,train_non_abusive],ignore_index=True)
train_non_abusive=train_non_abusive.drop_duplicates(keep=False)
print len(train_abusive),len(train_non_abusive)
它提供以下输出:
50000 200000
44596 155010
但是数学没有成功。我不确定为什么。
答案 0 :(得分:0)
已编辑:如果只想比较2个数据帧,则可以使用assert。
train_abusive=pd.read_csv('train_abusive.csv',low_memory=False)
train_non_abusive=pd.read_csv('train_non_abusive.csv',low_memory=False)
from pandas.util.testing import assert_frame_equal
assert_frame_equal(train_abusive, train_non_abusive)
我也看到Tom Chapin在另一个您可能会感兴趣的post中给出的答案。
def get_different_rows(train_abusive, train_non_abusive):
"""Returns just the rows from the new dataframe that differ from the source dataframe"""
merged_df = train_abusive.merge(train_non_abusive, indicator=True, how='outer')
changed_rows_df = merged_df[merged_df['_merge'] == 'right_only']
return changed_rows_df.drop('_merge', axis=1)