如何通过查找列表和熊猫中其他列之间的最佳匹配来填充列?

时间:2020-09-11 15:08:11

标签: python python-3.x regex pandas dataframe

因此,我正在尝试使用Pandas / Python处理具有过帐日期,交易说明和金额的银行帐户电子表格。我想创建一个名为“ VENDOR Name”的新列,该列读取交易说明,并用存储在vendors中的供应商列表中的“ VENDOR NAME”的最匹配填充新列。我将提供我尝试过的示例(具有在堆栈溢出时发现的功能)。描述信息已更改为删除敏感信息,但是格式仍然相同。我有一个名为vendor_type.csv的供应商电子表格,其中包含一个比我在此处显示的要大得多的供应商列表。我仍将使用vendors = vendors_df['vendor_name'].tolist()将其转换为列表,其格式将与以下相同。

import pandas as pd
import numpy as np
import re

In [1]: import pandas as pd
   ...: import numpy as np
   ...: import re

In [2]: df = pd.DataFrame({'Posting Date': ['2020-02-20', '2020-02-20', '2020-02-20', '2020-02-21', '2020-02-21'],
   ...:                   'Description': ['CHECK 12345', 'CHECK 1234', 'FPL DIRECT DEBIT ELEC PYMT', 'CHECK 9874', 'ADP PAYROLL FEES ADP - FEES'],
   ...:                   'Amount': [-500, -700, -400, -600, -90]})

In [3]: print(df)
  Posting Date                  Description  Amount
0   2020-02-20                  CHECK 12345    -500
1   2020-02-20                   CHECK 1234    -700
2   2020-02-20   FPL DIRECT DEBIT ELEC PYMT    -400
3   2020-02-21                   CHECK 9874    -600
4   2020-02-21  ADP PAYROLL FEES ADP - FEES     -90

In [4]: vendors = ['PAYROLL CHECK', 'FPL', 'ADP Payroll fees']
   ...: pattern = '|'.join(vendors)

In [5]: def pattern_searcher(search_str:str, search_list:str):
   ...:     search_obj = re.search(search_list, search_str)
   ...:     if search_obj:
   ...:         return_str = search_str[search_obj.start(): search_obj.end()]
   ...:     else:
   ...:         return_str = 'NA'
   ...:     return return_str
   ...:     

In [6]: df['VENDOR Name'] = df['Description'].apply(lambda x: pattern_searcher(search_str=x, search_list=pattern))

In [7]: print(df)
  Posting Date                  Description  Amount VENDOR Name
0   2020-02-20                  CHECK 12345    -500          NA
1   2020-02-20                   CHECK 1234    -700          NA
2   2020-02-20   FPL DIRECT DEBIT ELEC PYMT    -400         FPL
3   2020-02-21                   CHECK 9874    -600          NA
4   2020-02-21  ADP PAYROLL FEES ADP - FEES     -90          NA

最终结果应该是这样的:

  Posting Date                  Description  Amount       VENDOR Name
0   2020-02-20      CHECK 12345 VENDOR_NAME    -500      CHECK-VENDOR
1   2020-02-20                   CHECK 1234    -700     PAYROLL CHECK
2   2020-02-20   FPL DIRECT DEBIT ELEC PYMT    -400               FPL
3   2020-02-21                   CHECK 9874    -600     PAYROLL CHECK
4   2020-02-21  ADP PAYROLL FEES ADP - FEES     -90  ADP Payroll fees

我仍然想使用上面用来对一个事务进行分类的函数(因为它可以工作),但这不是必需的。我还想使用RegEx规则,以防供应商列表确实扩展。我有点受困于此,非常感谢您对如何做到这一点有任何见识。

谢谢。

1 个答案:

答案 0 :(得分:0)

您不想匹配模式(正则表达式)。您想查找供应商名称和描述之间的相似性。这可以通过多种方式完成,但我真的很喜欢fuzzywuzzy

import pandas

from typing import Optional
from fuzzywuzzy import fuzz, process


# Your input data
df = pandas.DataFrame(
    {
        "Posting Date": [
            "2020-02-20",
            "2020-02-20",
            "2020-02-20",
            "2020-02-21",
            "2020-02-21",
        ],
        "Description": [
            "CHECK 12345",
            "CHECK 1234",
            "FPL DIRECT DEBIT ELEC PYMT",
            "CHECK 9874",
            "ADP PAYROLL FEES ADP - FEES",
        ],
        "Amount": [-500, -700, -400, -600, -90],
    }
)

# List of vendors (can be loaded from file...)
vendors = ["PAYROLL CHECK", "FPL", "ADP Payroll fees"]


def matcher(description: str) -> Optional[str]:
    """Function that matches a description of a payment to a
    vendor in a list of vendors (fuzzy match).

    Args:
        description (str): The description to read

    Returns:
        str|None: The matching vendor (if we're certain enough about the match)
    """
    match, certainty = process.extractOne(
        description, vendors, scorer=fuzz.partial_ratio
    )
    if certainty >= 50:
        return match
    else:
        return None


df["VENDOR Name"] = df["Description"].apply(matcher)
df

输出:

  Posting Date                  Description  Amount       VENDOR Name
0   2020-02-20                  CHECK 12345    -500     PAYROLL CHECK
1   2020-02-20                   CHECK 1234    -700     PAYROLL CHECK
2   2020-02-20   FPL DIRECT DEBIT ELEC PYMT    -400               FPL
3   2020-02-21                   CHECK 9874    -600     PAYROLL CHECK
4   2020-02-21  ADP PAYROLL FEES ADP - FEES     -90  ADP Payroll fees

注意:带有 certainty 的部分是找到匹配的程度。这是可选的,因为您可以只返回第一个/最佳匹配。