我正在尝试为财务数据整理一个通用匹配流程。目标是获取具有较大事务的一组数据,并将其与具有较小事务的一组数据相匹配。有些是一对多,有些是一对一。 有几次它可能会被逆转,部分方法是以相反的顺序反馈未命中匹配以捕获那些可能的匹配。
我创建了三个不同的模块来互相迭代以完成工作,但我没有得到一致的结果。我看到我的数据中可能存在的匹配应该被拾取但不是。
也没有明确的匹配标准,所以假设我将数据集按日期顺序排列,并查找匹配值,我想采取第一个匹配,因为它应该更接近相同的时间范围。
我正在使用Pandas和Itertools,但可能不是理想的格式。任何帮助以获得一致的匹配将不胜感激。
Data examples:
Large Transaction Size:
AID AIssue Date AAmount
1508 3/14/2018 -560
1506 3/27/2018 -35
1500 4/25/2018 5000
Small Transaction Size:
BID BIssue Date BAmount
1063 3/6/2018 -300
1062 3/6/2018 -260
839 3/22/2018 -35
423 4/24/2018 5000
Expected Results
AID AIssue Date AAMount BID BIssue Date BAmount
1508 3/14/2018 -560 1063 3/6/2018 -300
1508 3/14/2018 -560 1062 3/6/2018 -260
1506 3/27/2018 -35 839 3/22/2018 -35
1500 4/25/2018 5000 423 4/24/2018 5000
but I usually get
AID AIssue Date AAMount BID BIssue Date BAmount
1508 3/14/2018 -560 1063 3/6/2018 -300
1508 3/14/2018 -560 1062 3/6/2018 -260
1506 3/27/2018 -35 839 3/22/2018 -35
与5000不匹配。这是一个例子,但在查看较大的数据集时,正面否定似乎不是因素。
在查看每个结果的不匹配结果时,我发现至少有一个5000美元的交易我希望是1-1匹配,而且不在结果中。
def matches(iterable):
s = list(iterable)
#Only going to 5 matches to avoid memory overrun on large datasets
s = list(itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(5)))
return [list(elem) for elem in s]
def one_to_many(dfL, dfS, dID = 0, dDT = 1, dVal = 2):
#dfL = dataset with larger values
#dfS = dataset with smaller values
#dID = column index of ID record
#dDT = column index of date record
#dVal = column index of dollar value record
S = dfS[dfS.columns[dID]].values.tolist()
S_amount = dfS[dfS.columns[dVal]].values.tolist()
S = matches(S)
S_amount = matches(S_amount)
#get ID of first large record, the ID to be matched in this module
L = dfL[dfL.columns[dID]].iloc[0]
#get Value of first large record, this value will be matching criteria
L_amount = dfL[dfL.columns[dVal]].iloc[0]
count_of_sets = len(S)
for a in range(0,count_of_sets):
list_of_items = S[a]
list_of_values = S_amount[a]
if round(sum(list_of_values),2) == round(L_amount,2):
break
if round(sum(list_of_values),2) == round(L_amount,2):
retVal = list_of_items
else:
retVal = [-1]
return retVal
def iterate_one_to_many(dfLarge, dfSmall, dID = 0, dDT = 1, dVal = 2):
#dfL = dataset with larger values
#dfS = dataset with smaller values
#dID = column index of ID record
#dDT = column index of date record
#dVal = column index of dollar value record
#returns a list of dataframes [paired matches, unmatched from dfL, unmatched from dfS]
dfLarge = dfLarge.set_index(dfLarge.columns[dID]).sort_values([dfLarge.columns[dDT], dfLarge.columns[dVal]]).reset_index()
dfSmall = dfSmall.set_index(dfSmall.columns[dID]).sort_values([dfSmall.columns[dDT], dfSmall.columns[dVal]]).reset_index()
end_row = len(dfLarge.columns[dID]) - 1
matches_master = pd.DataFrame(data = None, columns = dfLarge.columns.append(dfSmall.columns))
for lg in range(0,end_row):
sm_match_id = one_to_many(dfLarge, dfSmall)
lg_match_id = dfLarge[dfLarge.columns[dID]][lg]
if sm_match_id != [-1]:
end_of_matches = len(sm_match_id)
for sm in range(0, end_of_matches):
if sm == 0:
sm_match = dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy()
dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
else:
sm_match = sm_match.append(dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy())
dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
lg_match = dfLarge.loc[dfLarge[dfLarge.columns[dID]] == lg_match_id].copy()
sm_match['Match'] = lg
lg_match['Match'] = lg
sm_match.set_index('Match', inplace=True)
lg_match.set_index('Match', inplace=True)
matches = lg_match.join(sm_match, how='left')
matches_master = matches_master.append(matches)
dfLarge = dfLarge.loc[dfLarge[dfLarge.columns[dID]] != lg_match_id].copy()
return [matches_master, dfLarge, dfSmall]
答案 0 :(得分:0)
IIUUC,匹配只是为了查找大DataFrame
中的交易,该交易是小交易中的交易或最近的交易。您可以使用pandas.merge_asof()
根据将来最近的日期执行匹配。
import pandas as pd
# Ensure your dates are datetime
df_large['AIssue Date'] = pd.to_datetime(df_large['AIssue Date'])
df_small['BIssue Date'] = pd.to_datetime(df_small['BIssue Date'])
merged = pd.merge_asof(df_small, df_large, left_on='BIssue Date',
right_on='AIssue Date', direction='forward')
merged
现在是:
BID BAmount BIssue Date AID AAmount AIssue Date
0 1063 -300 2018-03-06 1508 -560 2018-03-14
1 1062 -260 2018-03-06 1508 -560 2018-03-14
2 839 -35 2018-03-22 1506 -35 2018-03-27
3 423 5000 2018-04-24 1500 5000 2018-04-25
如果你希望事情永远不会匹配,你也可以使用tolerance
将匹配限制在一个较小的窗口内。这样,一个DataFrame
中缺少的值就不会抛出一切关闭。
答案 1 :(得分:0)
在我的模块iterate_one_to_many中,我错误地计算了行长度。我需要更换
end_row = len(dfLarge.columns[dID]) - 1
带
end_row = len(dfLarge.index)