以下是我正在处理的示例数据。
sender receiver date id
salman akhtar 20161201 1111
akhtar salman 20161201 1112
nabeel ahmed 20161201 1113
salman akhtar 20161201 1114
salman akhtar 20161202 1115
nabeel ahmed 20161202 1116
ahmed nabeel 20161202 1117
nabeel ahmed 20161202 1118
nabeel ahmed 20161202 1119
我想要达到的目的是根据条件,同一发件人和同一接收者在同一日期内找到重复的条目。
为此,我编写了以下代码。
import pandas as pd
import xlsxwriter
print 'Script for Finding duplicate entries\n'
path = raw_input('Enter file name: ')
print 'Loading file. Please wait...'
xlsx = pd.ExcelFile(path+'.xlsx')
print 'File loaded successfully.\n'
sheet = raw_input('Enter Sheet Name: ')
df = pd.read_excel(xlsx, sheet)
df['is_duplicated'] = df.duplicated(['sender', 'receiver','date'],keep=False)
df_dup = df.loc[df['is_duplicated'] == True]
print 'Found Below Duplicates'
print df_dup
writer = pd.ExcelWriter("pandas_column_formats.xlsx", engine='xlsxwriter')
df_dup.to_excel(writer, sheet_name='Sheet1')
writer.save()
print 'File created successfully.'
现在,我想合并fuzzywuzzy
,因为当前代码只返回EXACT重复项,并且我希望所有可能的重复行都基于上述条件。
有人可以帮忙吗?
答案 0 :(得分:0)
这样的东西?
>>> fuzz_ratio = 50
>>> df_rem = df[~df.duplicated(['sender', 'receiver','date'],keep=False)]
>>> df_possible_dup = pd.merge(df_rem, df, on='date', suffixes=['', '_j'])
>>> df_possible_dup.apply(lambda x: fuzz.ratio(x['sender'], x['sender_j']) >= 50 and x['id'] != x['id_j'], axis=1)
我不知道您的具体要求,但可能您想检查发送方或接收方是否完全匹配,其他部分是否可能匹配。然后您可以使用自定义功能:
def worker(x, fuzz_ratio):
if x['id'] != x['id_j']:
return False
if x['sender'] == x['sender_j'] and fuzz.ratio(x['receiver'], x['receiver_j']) > fuzz_ratio:
return True
if x['receiver'] == x['receiver_j'] and fuzz.ratio(x['sender'], x['sender_j']) > fuzz_ratio:
return True
return False
>>> df_possible_dup.apply(lambda x: worker(x, fuzz_ratio))