因此,我正在尝试使用Pandas / Python处理具有过帐日期,交易说明和金额的银行帐户电子表格。我想创建一个名为“ VENDOR Name”的新列,该列读取交易说明,并用存储在vendors
中的供应商列表中的“ VENDOR NAME”的最匹配填充新列。我将提供我尝试过的示例(具有在堆栈溢出时发现的功能)。描述信息已更改为删除敏感信息,但是格式仍然相同。我有一个名为vendor_type.csv
的供应商电子表格,其中包含一个比我在此处显示的要大得多的供应商列表。我仍将使用vendors = vendors_df['vendor_name'].tolist()
将其转换为列表,其格式将与以下相同。
import pandas as pd
import numpy as np
import re
In [1]: import pandas as pd
...: import numpy as np
...: import re
In [2]: df = pd.DataFrame({'Posting Date': ['2020-02-20', '2020-02-20', '2020-02-20', '2020-02-21', '2020-02-21'],
...: 'Description': ['CHECK 12345', 'CHECK 1234', 'FPL DIRECT DEBIT ELEC PYMT', 'CHECK 9874', 'ADP PAYROLL FEES ADP - FEES'],
...: 'Amount': [-500, -700, -400, -600, -90]})
In [3]: print(df)
Posting Date Description Amount
0 2020-02-20 CHECK 12345 -500
1 2020-02-20 CHECK 1234 -700
2 2020-02-20 FPL DIRECT DEBIT ELEC PYMT -400
3 2020-02-21 CHECK 9874 -600
4 2020-02-21 ADP PAYROLL FEES ADP - FEES -90
In [4]: vendors = ['PAYROLL CHECK', 'FPL', 'ADP Payroll fees']
...: pattern = '|'.join(vendors)
In [5]: def pattern_searcher(search_str:str, search_list:str):
...: search_obj = re.search(search_list, search_str)
...: if search_obj:
...: return_str = search_str[search_obj.start(): search_obj.end()]
...: else:
...: return_str = 'NA'
...: return return_str
...:
In [6]: df['VENDOR Name'] = df['Description'].apply(lambda x: pattern_searcher(search_str=x, search_list=pattern))
In [7]: print(df)
Posting Date Description Amount VENDOR Name
0 2020-02-20 CHECK 12345 -500 NA
1 2020-02-20 CHECK 1234 -700 NA
2 2020-02-20 FPL DIRECT DEBIT ELEC PYMT -400 FPL
3 2020-02-21 CHECK 9874 -600 NA
4 2020-02-21 ADP PAYROLL FEES ADP - FEES -90 NA
最终结果应该是这样的:
Posting Date Description Amount VENDOR Name
0 2020-02-20 CHECK 12345 VENDOR_NAME -500 CHECK-VENDOR
1 2020-02-20 CHECK 1234 -700 PAYROLL CHECK
2 2020-02-20 FPL DIRECT DEBIT ELEC PYMT -400 FPL
3 2020-02-21 CHECK 9874 -600 PAYROLL CHECK
4 2020-02-21 ADP PAYROLL FEES ADP - FEES -90 ADP Payroll fees
我仍然想使用上面用来对一个事务进行分类的函数(因为它可以工作),但这不是必需的。我还想使用RegEx规则,以防供应商列表确实扩展。我有点受困于此,非常感谢您对如何做到这一点有任何见识。
谢谢。
答案 0 :(得分:0)
您不想匹配模式(正则表达式)。您想查找供应商名称和描述之间的相似性。这可以通过多种方式完成,但我真的很喜欢fuzzywuzzy
:
import pandas
from typing import Optional
from fuzzywuzzy import fuzz, process
# Your input data
df = pandas.DataFrame(
{
"Posting Date": [
"2020-02-20",
"2020-02-20",
"2020-02-20",
"2020-02-21",
"2020-02-21",
],
"Description": [
"CHECK 12345",
"CHECK 1234",
"FPL DIRECT DEBIT ELEC PYMT",
"CHECK 9874",
"ADP PAYROLL FEES ADP - FEES",
],
"Amount": [-500, -700, -400, -600, -90],
}
)
# List of vendors (can be loaded from file...)
vendors = ["PAYROLL CHECK", "FPL", "ADP Payroll fees"]
def matcher(description: str) -> Optional[str]:
"""Function that matches a description of a payment to a
vendor in a list of vendors (fuzzy match).
Args:
description (str): The description to read
Returns:
str|None: The matching vendor (if we're certain enough about the match)
"""
match, certainty = process.extractOne(
description, vendors, scorer=fuzz.partial_ratio
)
if certainty >= 50:
return match
else:
return None
df["VENDOR Name"] = df["Description"].apply(matcher)
df
输出:
Posting Date Description Amount VENDOR Name
0 2020-02-20 CHECK 12345 -500 PAYROLL CHECK
1 2020-02-20 CHECK 1234 -700 PAYROLL CHECK
2 2020-02-20 FPL DIRECT DEBIT ELEC PYMT -400 FPL
3 2020-02-21 CHECK 9874 -600 PAYROLL CHECK
4 2020-02-21 ADP PAYROLL FEES ADP - FEES -90 ADP Payroll fees
注意:带有 certainty
的部分是找到匹配的程度。这是可选的,因为您可以只返回第一个/最佳匹配。