合并CSV文件中类似字符串的值

时间:2019-07-05 22:24:11

标签: python fuzzywuzzy

因此,我有一个包含交易的CSV文件,其中供应商名称位于一列,交易金额位于另一列。我们的目标是找到交易总数最大的供应商。这部分非常简单,我有如下代码:

with open('Transactions.csv') as Vendor_Data:
    file_reader = csv.reader(Vendor_Data, delimiter=',')
    vendor_dict = {}
    next(file_reader)
    for row in file_reader:
        if row[3] not in vendor_dict:
            vendor_dict[row[3]] = [0, 0]
            vendor_dict[row[3]][1] += round(float(row[1]), 2)
        else:
            vendor_dict[row[3]][0] += 1
            vendor_dict[row[3]][1] += round(float(row[1]), 2)

问题是,有很多条目中相同供应商的拼写略有不同(“达美航空”诉“达美航空”)。在循环CSV文件并合并交易实例和金额时,检测这些相似的字符串名称(例如,使用Fuzzywuzzy)的最佳方法是什么?

2 个答案:

答案 0 :(得分:0)

import csv

from fuzzywuzzy import fuzz

with open('Transactions.csv') as Vendor_Data:
    file_reader = csv.reader(Vendor_Data, delimiter=',')
    vendor_dict = {}
    next(file_reader)  # skipping a header?
    for row in file_reader:

        # we can't use the dictionary directly (e.g. "key in vendor_dict")
        # because we want to do a similarity search.
        csv_name = row[3]
        for vendor_name, vendor_values in vendor_dict.iteritems():

            # this is *a* way to do it. You may want to use different scores
            # or even a different comparison
            if fuzz.token_set_ratio(csv_name, vendor_name) > 80:
                vendor_values[0] += 1
                vendor_values[1] += round(float(row[1]), 2)
                break
        else:
            # we didn't find anything similar enough, so create an entry
            vendor_values = [0, 0]
            vendor_values[1] += round(float(row[1]), 2)

        vendor_dict[csv_name] = vendor_values

答案 1 :(得分:0)

读取熊猫的csv文件。然后为匹配百分比fuzzywuzzy添加一个新列。

创建一个阈值,确定应将哪个百分比视为同一字符串,然后使用isin()方法进行过滤,然后添加交易金额列的值进行计算。

将其循环到整个DataFrame,您将获得所需的结果。