基于模糊逻辑的行(单张/多张)重复数据删除

时间:2019-06-04 12:21:28

标签: python-3.x duplicates spreadsheet fuzzy-logic

我是一个菜鸟编码者,但是渴望学习和理解这些技巧,你们是否会友好地为我指出正确的方向? 我有一个数据集,其中包含 1.名称-拼写不同 2.地址-拼写不同 3.城市-不变 4.状态-常数 5.密码-恒定在85%的数据中

我有12个文件被合并,现在它包含500,000行,其值重复。我正在寻找重复项并创建一个唯一列表。

我尝试了Fuzzywuzzy,并尝试通过2种方式解决它 1.分开保存csv并相互比较 2.合并所有CSV并在地址列上使用模糊逻辑

第一种方法(太慢杀死jupyter笔记本):

    df2 =pd.read_csv('apollo-munich-network-hospital.csv', delimiter = '|')
    compare = pd.MultiIndex.from_product([df1['address'],
                                          df2['address']]).to_series()
    def metrics(tup):
        return pd.Series([fuzz.ratio(*tup),
                          fuzz.token_sort_ratio(*tup)],
                         ['ratio', 'token'])
    test= compare.apply(metrics)

2nd approach(too slow that kernel interupted)

data = pd.read_csv('txt_files/hospital_to_clean_duplicates.csv',delimiter= '|')
with open('data_out.csv', 'w') as f1:
    writer = csv.writer(f1, delimiter='|', lineterminator='\n', )
    def remove_duplicates_inplace(data, groupby=[], similarity_field='', similar_level=85):
        def check_simi(d):
            dupl_indexes = []
            for i in range(len(d.values) - 1):
                for j in range(i + 1, len(d.values)):
                    if fuzz.token_sort_ratio(d.values[i], d.values[j]) >= similar_level:
                        dupl_indexes.append(d.index[j])

            return dupl_indexes

        indexes = data.groupby(groupby)[similarity_field].apply(check_simi)

        for index_list in indexes:
            data.drop(index_list, inplace=True)

    remove_duplicates_inplace(data, groupby=['name','address','pincode','city','state'], similarity_field='address')```

expected output
1. name
2. address (can i choose the longest string as parent, compare with others)
3. city
4. state
5. pincode

separate sheets with (if possible to evulate)
1. name
2. city
3. state
4. pincode
5. address1 -original 
6. matched address
7. match percentage

0 个答案:

没有答案