根据条件查找重复值

时间:2017-03-24 19:17:21

标签: python-2.7 fuzzywuzzy

以下是样本数据:

1 ,ASIF JAVED IQBAL JAVED,JAVED IQBAL SO INAYATHULLAH,20170103
2 ,SYED MUSTZAR ALI MUHAMMAD ILYAS SHAH,MUHAMMAD SAFEER SO SAGHEER KHAN,20170127
3 ,AHSUN SABIR SABIR ALI,MISBAH NAVEED DO NAVEED ANJUM,20170116
4 ,RASHAD IQBAL PARVAIZ IQBAL,PERVAIZ IQBAL SO GUL HUSSAIN KHAN,20170104
5 ,RASHID ALI MUGHERI ABDUL RASOOL MUGHERI,MUMTAZ ALI BOHIO,20170105
6 ,FAKHAR IMAM AHMAD ALI,MOHAMMAD AKHLAQ ASHIQ HUSSAIN,20170105
7 ,AQEEL SARWAR MUHAMMAD SARWAR BUTT,BUSHRA WAHID,20170106
8 ,SHAFAQAT ALI REHMAT ALI,SAJIDA BIBI WO MUHAMMAD ASHRAF,20170106
9 ,MUHAMMAD ISMAIL SHAFQAT HUSSAIN,USAMA IQBAL,20170103
10 ,SULEMAN ALI KACHI GHULAM ALI,MUHAMMAD SHARIF ALLAH WARAYO,20170109
第一个是序列号,第二个是发送者,第三个是接收者,第四个是日期 并且这些数据持续了数百万行。

现在,我想找到哪个发件人在同一天发送包裹到同一个接收者。

我为此编写了以下基本代码,但速度非常慢。

import csv
from fuzzywuzzy import fuzz



serial = []
agency = []
rem_name = []
rem_name2 = []
date = []

with open('janCSV.csv') as f:
    reader = csv.reader(f)

    for row in reader:
        serial.append(row[0])
        rem_name.append(row[2])
        rem_name2.append(row[2])
        date.append(row[4])


with open('output.csv', 'w') as out:

for rem1 in rem_name:

    date1 = date[rem_name.index(rem1)]
    serial1 = serial[rem_name.index(rem1)]

    for rem2 in rem_name2:

        date2 = date[rem_name2.index(rem2)]

        if date1 == date2:
            ratio = fuzz.ratio(rem1, rem2)

            if ratio >= 90 and ratio < 100:
                print serial1, rem1, rem2, date1, date2, ratio
                out.write(str(serial1) + ',' + str(date1) + ',' + str(date2) + ',' + str(rem1) + ',' + str(rem2) + ','
                          + str(ratio) + '\n')

0 个答案:

没有答案