这是我上一个问题的延续: How to fetch the modified rows after comparing 2 versions of same data frame
我现在完成了修改,但是,我正在使用下面的方法来查找插入和删除。 它工作正常,但是要花费很多时间。通常用于具有10列和10M行的CSV文件。
对于我的问题, INSERT是不在旧文件中,而是在新文件中的记录。 删除是旧文件中的记录,而不是新文件中的记录。
下面是代码:
def getInsDel(df_old,df_new,key):
#concatinating old and new data to generate comparisons
df = pd.concat([df_new,df_old])
df= df.reset_index(drop = True)
#doing a group by for getting the frequency of each key
print('Grouping data for frequency of key...')
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
df_delta = df.reindex(idx)
df_delta_freq = df_delta.groupby(key).size().reset_index(name='Freq')
#Filtering data for frequency = 1, since these will be the target records for DELETE and INSERT
print('Creating data frame to get records with Frequency = 1 ...')
filter = df_delta_freq['Freq']==1
df_delta_freq_ins_del = df_delta_freq.where(filter)
#Dropping row with NULL
df_delta_freq_ins_del = df_delta_freq_ins_del.dropna()
print('Creating data frames of Insert and Deletes ...')
#Creating INSERT dataFrame
df_ins = pd.merge(df_new,
df_delta_freq_ins_del[key],
on = key,
how = 'inner'
)
#Creating DELETE dataFrame
df_del = pd.merge(df_old,
df_delta_freq_ins_del[key],
on = key,
how = 'inner'
)
print('size of INSERT file: ' + str(df_ins.shape))
print('size of DELETE file: ' + str(df_del.shape))
return df_ins,df_del
我按每个键的频率进行分组的部分大约占总时间的80%,因此,对于CSV来说,大约需要12-15分钟。
必须有一种更快的方法吗?
供参考,以下是我对结果的期望:
例如,旧数据为:
ID Name X Y
1 ABC 1 2
2 DEF 2 3
3 HIJ 3 4
,新的数据集是:
ID Name X Y
2 DEF 2 3
3 HIJ 55 42
4 KLM 4 5
其中ID是密钥。
Insert_DataFrame应该是:
ID Name X Y
4 KLM 4 5
Deleted_DataFrame应该是:
ID Name X Y
1 ABC 1 2
答案 0 :(得分:0)
delete=pd.merge(old,new,how='left',on='ID',indicator=True)
delete=delete.loc[delete['_merge']=='left_only']
delete.dropna(1,inplace=True)
insert=pd.merge(new,old,how='left',on='ID',indicator=True)
insert=insert.loc[insert['_merge']=='left_only']
insert.dropna(1,inplace=True)