将两个数据帧转换为numpy数组以进行成对比较

时间:2020-06-30 19:12:26

标签: python pandas dataframe memory

我有两个非常大的dataframesdf1df2。它们的大小如下:

print(df1.shape) #444500 x 3062
print(df2.shape) #254232 x 3062

我知道df2的每个值都出现在df1中,我要做的是建立一个 third 数据框,这是两者的区别,这意味着,则出现在df1中且没有的所有行都出现在df2中。

我尝试使用以下方法from this question

df3 = (pd.merge(df2,df1, indicator=True, how='outer')
            .query('_merge=="left_only"').drop('_merge', axis=1))

但是由于这个原因,我不断遇到MemoryError错误

因此,我现在正在尝试执行以下操作:

  1. 遍历df1的每一行
  2. 查看df1是否出现在df2中
  3. 如果有,请跳过
  4. 如果没有,请将其添加到列表中

我担心的是,在行方面,是相等的,例如,所有元素都是成对匹配的

[1,2,3]
[1,2,3]

是一场比赛,而:

[1,2,3]
[1,3,2]

不是匹配项

我正在尝试:

for i in notebook.tqdm(range(svm_data.shape[0])):
    real_row = np.asarray(real_data.iloc[[i]].to_numpy())
    synthetic_row = np.asarray(svm_data.iloc[[i]].to_numpy())
    if (np.array_equal(real_row, synthetic_row)):
        continue
    else:
        list_of_rows.append(list(synthetic_row))
    gc.collect()

但是由于某种原因,这并没有在行本身中找到值,所以我显然仍然在做错事。

注意,我也尝试过: df3 = df1[~df1.isin(df2)].dropna(how='all')

但是结果不正确。

如何(以节省内存的方式)在我的数据框中找到所有行

数据

df1:

1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2

df2:

1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,3
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,2.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,1,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,5,0,0,0,0,0.0,4

1 个答案:

答案 0 :(得分:1)

让我们尝试concatgroupby来识别重复的行:

# sample data
df1 = pd.DataFrame([[1,2,3],[1,2,3],[4,5,6],[7,8,9]])
df2 = pd.DataFrame([[4,5,6],[7,8,9]])

s = (pd.concat((df1,df2), keys=(1,2))
       .groupby(list(df1.columns))
       .ngroup()
    )

# `s.loc[1]` corresponds to rows in df1
# `s.loc[2]` corresponds to rows in df2
df1_in_df2 = s.loc[1].isin(s.loc[2])

df1[df1_in_df2]

输出:

   0  1  2
2  4  5  6
3  7  8  9

更新另一个选项是在非重复 df2上合并:

df1.merge(df2.drop_duplicates(), on=list(df1.columns), indicator=True, how='left')

输出(您应该能够从那里猜测出需要的行):

   0  1  2     _merge
0  1  2  3  left_only
1  1  2  3  left_only
2  4  5  6       both
3  7  8  9       both