Question

我有2个数据框。为了简单起见，我们将它们称为DF1和DF2。

DF1结构：

id Name address email
1   abc  add1    email1

DF2结构：

id2 Name address2 email2
1    abc  add1    sample

如果有帮助，这是我的代码：

def helper(row, Threshold):  
    dfExactDropped['score']=dfExactDropped['joined'].apply(lambda x: fuzz.WRatio(row['joined'],x))

    PerfectMatchSeries = dfExactDropped['score'].apply(lambda x: True if x == 100 else False)

    ThresholdMatchSeries = dfExactDropped['score'].apply(lambda x: True if x >=Threshold and x<100 else False)

    numPerfectMatch = len(PerfectMatchSeries[PerfectMatchSeries == True].index)

    numThresholdMatch = len(ThresholdMatchSeries[ThresholdMatchSeries == True].index)

    if (numPerfectMatch==0 and numThresholdMatch ==0) :
        P1=0
        P2=0
    elif(numPerfectMatch==0 and numThresholdMatch ==1):
        P1=0
        P2=1
    elif(numPerfectMatch==0 and numThresholdMatch >1):
        P1=0
        P2=2
    elif (numPerfectMatch==1 and numThresholdMatch ==0):
        P1=1
        P2=0
    elif (numPerfectMatch==1 and numThresholdMatch ==1):
        P1=1
        P2=1
    elif (numPerfectMatch==1 and numThresholdMatch >1):
        P1=1
        P2=2
    elif (numPerfectMatch>1 and numThresholdMatch ==0):
        P1=2
        P2=0
    elif (numPerfectMatch>1 and numThresholdMatch ==1):
        P1=2
        P2=1
    elif (numPerfectMatch>1 and numThresholdMatch >1):
        P1=2
        P2=2
    return pd.Series([numPerfectMatch,numThresholdMatch, P1,P2])

def fwNameMatcher2(baseData, Matchset, Threshold, rows=100):
      baseData[['numPerfectMatch', 'numThresholdMatch', 'P1', 'P2']] = baseData.apply(lambda row: helper(row,Threshold),axis=1)

已加入列代表姓名。

我想进行模糊匹配，以使DF1中的名称与DF2中的所有行匹配，并且得分是为DF2创建的列。然后，我想运行一些操作并将这些操作的值加回到DF1当前行中，将它们作为值处理。

我能够使用2个应用程序来做到这一点，但是即使10行也需要2分钟。我必须在每个数据集中处理50万行以上的行。有没有使用矢量化方法来完成此操作的更快方法。

感谢您的帮助。

遍历熊猫DF并使用值对另一个DF的所有值运行函数

0 个答案: