Question

我有一个数据库表，该表具有一个外部进程，它会每隔如此频繁地自动插入行。由于数据的性质，我需要保持警惕，以检测任何可能的“重复项”（实际上只是相似的行，因为只有某些列很重要）并删除它们。我计划使用数据库查询将所有相似的行标识到一个数据框中，然后对“保持”行的一个数据框子集进行排序和创建。该想法正在使用从原始数据框到'keep'数据框的左联接以及布尔语句来标识每个pandas get rows which are NOT in other dataframe需要删除的行。你能告诉我我走的路是否正确吗？对于要从数据库中删除记录的逻辑，我要非常小心。

原始数据框如下：

   ID Account Type  Date       RowID
0  12  GOB     H    11/12/18   Az123
1  12  GOB     H    11/12/18   Az125
2  12  JPG     H    11/15/18   Az175
3  12  JPG     H    11/17/18   Az189
4  15  BLU     H    11/1/18    Ax127
5  15  BLU     D    11/18/18   Ax135
6  15  BLU     H    11/8/18    Ax175

为每个ID /帐户组合保留一条记录非常重要，优先选择类型D，然后是最早日期的帐户。下面是所需的Keep子集。

所需的保留子集

   ID Account Type  Date       RowID
0  15  BLU     D    11/18/18   Ax135
1  12  GOB     H    11/12/18   Az123
2  12  JPG     H    11/15/18   Az175

代码：在W-B的帮助下进行编辑

df = pd.read_sql(similar_rows_sql)
df['helpkey']=df.Type.eq('D')
keep_df = df.sort_values(['Date']).sort_values(
          ['helpkey'], ascending=False).drop_duplicates(['ID','Account'], keep='first')
df_all = df.merge(keep_df, how='left', indicator=True)
df_remove = df_all.loc[df_all['_merge']== 'left_only']
for x in df_remove[RowID]:
    cursor.execute(remove_duplicate_sql, x)
connection.commit()

由于使用W-B而消除了类型问题

我唯一剩下的就是这种逻辑是Python的，并且符合我的意图。任何人都可以让我担心这是正确的吗？

Answer 1

使用帮助键

df['helpkey']=df.Type.eq('D')# return T when it is D , so we sort the helpkey , make sure D always at the end 

df.Date=pd.to_datetime(df.Date)
df.sort_values(['ID','helpkey','Date']).drop_duplicates(['ID','Account'],keep='last')
Out[163]: 
   ID Account Type       Date  RowID  helpkey
1  12     GOB    H 2018-11-12  Az125    False
3  12     JPG    H 2018-11-17  Az189    False
5  15     BLU    D 2018-11-18  Ax135     True

熊猫通过条件识别和删除相似/重复的行

1 个答案: