比较2个相似数据框差异的更快方法

时间:2019-07-25 15:52:06

标签: python pandas dataframe

这是我上一个问题的延续: How to fetch the modified rows after comparing 2 versions of same data frame

我现在完成了修改,但是,我正在使用下面的方法来查找插入和删除。 它工作正常,但是要花费很多时间。通常用于具有10列和10M行的CSV文件。

对于我的问题, INSERT是不在旧文件中,而是在新文件中的记录。 删除是旧文件中的记录,而不是新文件中的记录。

下面是代码:

def getInsDel(df_old,df_new,key):
    #concatinating old and new data to generate comparisons
    df = pd.concat([df_new,df_old])
    df= df.reset_index(drop = True)


    #doing a group by for getting the frequency of each key
    print('Grouping data for frequency of key...')
    df_gpby = df.groupby(list(df.columns))
    idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
    df_delta = df.reindex(idx)
    df_delta_freq = df_delta.groupby(key).size().reset_index(name='Freq')

    #Filtering data for frequency = 1, since these will be the target records for DELETE and INSERT 
    print('Creating data frame to get records with Frequency = 1  ...')
    filter = df_delta_freq['Freq']==1
    df_delta_freq_ins_del = df_delta_freq.where(filter)


    #Dropping row with NULL
    df_delta_freq_ins_del = df_delta_freq_ins_del.dropna()


    print('Creating data frames of Insert and Deletes  ...')
    #Creating INSERT dataFrame 
    df_ins = pd.merge(df_new, 
                     df_delta_freq_ins_del[key],
                     on = key,
                     how = 'inner'
                    )

    #Creating DELETE dataFrame
    df_del = pd.merge(df_old, 
                     df_delta_freq_ins_del[key],
                     on = key,
                     how = 'inner'
                    )

    print('size of INSERT file: ' + str(df_ins.shape))
    print('size of DELETE file: ' + str(df_del.shape))


    return df_ins,df_del

我按每个键的频率进行分组的部分大约占总时间的80%,因此,对于CSV来说,大约需要12-15分钟。

必须有一种更快的方法吗?

供参考,以下是我对结果的期望:

例如,旧数据为:

ID  Name  X  Y
1   ABC   1  2
2   DEF   2  3
3   HIJ   3  4

,新的数据集是:

ID  Name   X   Y
2   DEF    2   3
3   HIJ    55  42
4   KLM    4   5

其中ID是密钥。

Insert_DataFrame应该是:

ID   Name   X   Y
4    KLM    4   5

Deleted_DataFrame应该是:

ID   Name   X   Y
1    ABC    1   2

1 个答案:

答案 0 :(得分:0)

要删除

delete=pd.merge(old,new,how='left',on='ID',indicator=True)
delete=delete.loc[delete['_merge']=='left_only']
delete.dropna(1,inplace=True)

要插入

insert=pd.merge(new,old,how='left',on='ID',indicator=True)
insert=insert.loc[insert['_merge']=='left_only']
insert.dropna(1,inplace=True)