在两个Pandas Dataframes中查找一对多匹配

时间:2018-06-05 13:08:26

标签: python pandas itertools

我正在尝试为财务数据整理一个通用匹配流程。目标是获取具有较大事务的一组数据,并将其与具有较小事务的一组数据相匹配。有些是一对多,有些是一对一。 有几次它可能会被逆转,部分方法是以相反的顺序反馈未命中匹配以捕获那些可能的匹配。

我创建了三个不同的模块来互相迭代以完成工作,但我没有得到一致的结果。我看到我的数据中可能存在的匹配应该被拾取但不是。

也没有明确的匹配标准,所以假设我将数据集按日期顺序排列,并查找匹配值,我想采取第一个匹配,因为它应该更接近相同的时间范围。

我正在使用Pandas和Itertools,但可能不是理想的格式。任何帮助以获得一致的匹配将不胜感激。

Data examples:

Large Transaction Size:

AID    AIssue Date    AAmount
1508     3/14/2018   -560
1506     3/27/2018    -35
1500     4/25/2018   5000

Small Transaction Size:
BID     BIssue Date   BAmount
1063     3/6/2018     -300
1062     3/6/2018     -260
839      3/22/2018     -35
423      4/24/2018    5000

Expected Results
AID     AIssue Date   AAMount    BID     BIssue Date   BAmount
1508     3/14/2018     -560      1063      3/6/2018     -300
1508     3/14/2018     -560      1062      3/6/2018     -260
1506     3/27/2018      -35       839      3/22/2018     -35
1500     4/25/2018     5000       423      4/24/2018    5000

but I usually get
AID     AIssue Date   AAMount    BID     BIssue Date   BAmount
1508     3/14/2018     -560      1063      3/6/2018     -300
1508     3/14/2018     -560      1062      3/6/2018     -260
1506     3/27/2018      -35       839      3/22/2018     -35

与5000不匹配。这是一个例子,但在查看较大的数据集时,正面否定似乎不是因素。

在查看每个结果的不匹配结果时,我发现至少有一个5000美元的交易我希望是1-1匹配,而且不在结果中。

def matches(iterable):
    s = list(iterable)
    #Only going to 5 matches to avoid memory overrun on large datasets
    s = list(itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(5))) 

    return [list(elem) for elem in s]

def one_to_many(dfL, dfS, dID = 0, dDT = 1, dVal = 2):   
    #dfL = dataset with larger values
    #dfS = dataset with smaller values
    #dID = column index of ID record
    #dDT = column index of date record
    #dVal = column index of dollar value record

    S = dfS[dfS.columns[dID]].values.tolist()
    S_amount = dfS[dfS.columns[dVal]].values.tolist()

    S = matches(S)
    S_amount = matches(S_amount)

    #get ID of first large record, the ID to be matched in this module
    L = dfL[dfL.columns[dID]].iloc[0]

    #get Value of first large record, this value will be matching criteria
    L_amount = dfL[dfL.columns[dVal]].iloc[0]

    count_of_sets = len(S)

    for a in range(0,count_of_sets):

        list_of_items = S[a]
        list_of_values = S_amount[a]

        if round(sum(list_of_values),2) == round(L_amount,2):
            break

    if round(sum(list_of_values),2) == round(L_amount,2):
        retVal = list_of_items
    else:
        retVal = [-1]

    return retVal

def iterate_one_to_many(dfLarge, dfSmall, dID = 0, dDT = 1, dVal = 2):
    #dfL = dataset with larger values
    #dfS = dataset with smaller values
    #dID = column index of ID record
    #dDT = column index of date record
    #dVal = column index of dollar value record

    #returns a list of dataframes [paired matches, unmatched from dfL, unmatched from dfS]

    dfLarge = dfLarge.set_index(dfLarge.columns[dID]).sort_values([dfLarge.columns[dDT], dfLarge.columns[dVal]]).reset_index()
    dfSmall = dfSmall.set_index(dfSmall.columns[dID]).sort_values([dfSmall.columns[dDT], dfSmall.columns[dVal]]).reset_index()

    end_row = len(dfLarge.columns[dID]) - 1

    matches_master = pd.DataFrame(data = None, columns = dfLarge.columns.append(dfSmall.columns))

    for lg in range(0,end_row):

        sm_match_id = one_to_many(dfLarge, dfSmall)
        lg_match_id = dfLarge[dfLarge.columns[dID]][lg]

        if sm_match_id != [-1]:

            end_of_matches = len(sm_match_id)

            for sm in range(0, end_of_matches):
                if sm == 0:
                    sm_match = dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy()
                    dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
                else:
                    sm_match = sm_match.append(dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy())
                    dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()

            lg_match = dfLarge.loc[dfLarge[dfLarge.columns[dID]] == lg_match_id].copy()

            sm_match['Match'] = lg
            lg_match['Match'] = lg

            sm_match.set_index('Match', inplace=True)
            lg_match.set_index('Match', inplace=True)

            matches = lg_match.join(sm_match, how='left')
            matches_master = matches_master.append(matches)

            dfLarge = dfLarge.loc[dfLarge[dfLarge.columns[dID]] != lg_match_id].copy()

    return [matches_master, dfLarge, dfSmall]

2 个答案:

答案 0 :(得分:0)

IIUUC,匹配只是为了查找大DataFrame中的交易,该交易是小交易中的交易或最近的交易。您可以使用pandas.merge_asof()根据将来最近的日期执行匹配。

import pandas as pd
# Ensure your dates are datetime
df_large['AIssue Date'] = pd.to_datetime(df_large['AIssue Date'])
df_small['BIssue Date'] = pd.to_datetime(df_small['BIssue Date'])

merged = pd.merge_asof(df_small, df_large, left_on='BIssue Date', 
                       right_on='AIssue Date', direction='forward')

merged现在是:

    BID  BAmount BIssue Date   AID  AAmount AIssue Date
0  1063     -300  2018-03-06  1508     -560  2018-03-14
1  1062     -260  2018-03-06  1508     -560  2018-03-14
2   839      -35  2018-03-22  1506      -35  2018-03-27
3   423     5000  2018-04-24  1500     5000  2018-04-25

如果你希望事情永远不会匹配,你也可以使用tolerance将匹配限制在一个较小的窗口内。这样,一个DataFrame中缺少的值就不会抛出一切关闭。

答案 1 :(得分:0)

在我的模块iterate_one_to_many中,我错误地计算了行长度。我需要更换

end_row = len(dfLarge.columns[dID]) - 1

end_row = len(dfLarge.index)