目的
accepted
列表匹配。accepted
匹配。APPROACH
accepted
数据框accepted
数据框进行比较
CODE
#Load Excel File into dataframe
xl = pd.read_excel(open("/../data/expenses.xlsx",'rb'))
#Let's clarify how many similar categories exist...
q = """
SELECT DISTINCT Expense
FROM xl
ORDER BY Expense ASC
"""
expenses = sqldf(q)
print(expenses)
#Let's add some acceptable categories and use fuzzywuzzy to match
accepted = ['Severance', 'Legal Fees', 'Import & Export Fees', 'I.T. Fees', 'Board Fees', 'Acquisition Fees']
#select from the list of accepted values and return the closest match
process.extractOne("Company Acquired",accepted,scorer=fuzz.token_set_ratio)
('收购费',38) 不是高分,但足够高以至于它返回预期的输出
!!!!! ISSUE !!!!!
#Time to loop through all the expenses and use FuzzyWuzzy to generate and return the closest matches.
def correct_expense(expense):
for expense in expenses:
return expense, process.extractOne(expense,accepted,scorer = fuzz.token_set_ratio)
correct_expense(expenses)
('费用',('法律费用',47))
问题
答案 0 :(得分:1)
我过去做过的方法就是使用Python中difflib
模块的get_closest_matches
函数。然后,您可以创建一个函数来获得最接近的匹配,并将其应用于Expense
列。
def correct_expense(row):
accepted = ['Severance', 'Legal Fees', 'Import & Export Fees', 'I.T. Fees', 'Board Fees', 'Acquisition Fees']
match = get_close_matches(row, accepted, n=1, cutoff=0.3)
return match[0] if match else ''
df['Expense_match'] = df['Expense'].apply(correct_expense)
这是原始的Expense
列,其值与accepted
列表匹配:
您可能需要微调accepted
列表和cutoff
的{{1}}值(我发现0.3对您的示例数据非常有效)。
对结果感到满意后,您可以更改功能以覆盖get_closest_matches
列并使用pandas DataFrame方法to_excel
保存到Excel。
答案 1 :(得分:1)
这称为地名录重复数据删除。
您可以通过将杂乱数据与规范数据(即公报)进行匹配来执行重复数据删除。
pandas-dedupe 完全可以做到这一点。
示例:
import pandas as pd
import pandas_dedupe
clean_data = pd.DataFrame({'street': ['Onslow square', 'Sydney Mews', 'Summer Place', 'Bury Walk', 'sydney mews']})
messy_data = pd.DataFrame({'street_name':['Onslow sq', 'Sidney Mews', 'Summer pl', 'Onslow square', 'Bury walk', 'onslow sq', 'Bury Wall'],
'city' : ['London', 'London', 'London', 'London', 'London', 'London', 'London']})
dd = pandas_dedupe.gazetteer_dataframe(
clean_data,
messy_data,
field_properties = 'street_name',
canonicalize=True,
)
在此过程中,pandas-dedupe 会要求您将少数示例标记为重复或不同的记录。然后,图书馆将使用这些知识来查找潜在的重复条目,将它们与干净的数据进行匹配,并返回所有相关信息,包括对结果的置信度。