我有两个非常大的dataframes
,df1
和df2
。它们的大小如下:
print(df1.shape) #444500 x 3062
print(df2.shape) #254232 x 3062
我知道df2
的每个值都出现在df1
中,我要做的是建立一个 third 数据框,这是两者的区别,这意味着,则出现在df1
中且没有的所有行都出现在df2
中。
我尝试使用以下方法from this question:
df3 = (pd.merge(df2,df1, indicator=True, how='outer')
.query('_merge=="left_only"').drop('_merge', axis=1))
但是由于这个原因,我不断遇到MemoryError
错误
因此,我现在正在尝试执行以下操作:
我担心的是,在行方面,行是相等的,例如,所有元素都是成对匹配的
[1,2,3]
[1,2,3]
是一场比赛,而:
[1,2,3]
[1,3,2]
不是匹配项
我正在尝试:
for i in notebook.tqdm(range(svm_data.shape[0])):
real_row = np.asarray(real_data.iloc[[i]].to_numpy())
synthetic_row = np.asarray(svm_data.iloc[[i]].to_numpy())
if (np.array_equal(real_row, synthetic_row)):
continue
else:
list_of_rows.append(list(synthetic_row))
gc.collect()
但是由于某种原因,这并没有在行本身中找到值,所以我显然仍然在做错事。
注意,我也尝试过:
df3 = df1[~df1.isin(df2)].dropna(how='all')
但是结果不正确。
如何(以节省内存的方式)在我的数据框中找到所有行
数据
df1:
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2
df2:
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,3
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,2.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,1,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,5,0,0,0,0,0.0,4
答案 0 :(得分:1)
让我们尝试concat
和groupby
来识别重复的行:
# sample data
df1 = pd.DataFrame([[1,2,3],[1,2,3],[4,5,6],[7,8,9]])
df2 = pd.DataFrame([[4,5,6],[7,8,9]])
s = (pd.concat((df1,df2), keys=(1,2))
.groupby(list(df1.columns))
.ngroup()
)
# `s.loc[1]` corresponds to rows in df1
# `s.loc[2]` corresponds to rows in df2
df1_in_df2 = s.loc[1].isin(s.loc[2])
df1[df1_in_df2]
输出:
0 1 2
2 4 5 6
3 7 8 9
更新另一个选项是在非重复 df2
上合并:
df1.merge(df2.drop_duplicates(), on=list(df1.columns), indicator=True, how='left')
输出(您应该能够从那里猜测出需要的行):
0 1 2 _merge
0 1 2 3 left_only
1 1 2 3 left_only
2 4 5 6 both
3 7 8 9 both