如何找到两个df之间的匹配差异

时间:2021-05-30 20:42:04

标签: python pandas

寻找两个数据帧之间的百分比差异。我试过使用fuzzywuzzy,但没有得到相同的预期输出。

假设我有 2 个数据框,每个数据框有 4 列,我想找到这两个数据框之间的匹配百分比。

在执行代码之前发现 dtypes = float64,所以我改变了 dtypes = object 执行代码时出现错误 TypeError: object of type 'float' has no len()

df1

score   id_number       company_name      company_code   Amount
200      IN2231D           AXN pvt Ltd        IN225      2566.7           
450      UK654IN        Aviva Intl Ltd        IN115      3677           
650.8    SL1432H   Ship Incorporations        CZ555      NaN            
350      LK0678G  Oppo Mobiles pvt ltd        PQ795      367.9           
590      NG5678J             Nokia Inc        RS885      867           
250      IN2231D           AXN pvt Ltd        IN215      785.65

df2

QR_score     Identity_No       comp_name      comp_code     amt           match_acc   
    200.00      IN2231D           AXN pvt Inc        IN225    2566.70             
    420.0       UK655IN        Aviva Intl Ltd        IN315    3677.00             
    350.35      SL2252H              Ship Inc        CK555    NaN              
    450.00      LK9978G  Oppo Mobiles pvt ltd        PRS95    367.9             
    590.15      NG5678J             Nokia Inc        RS885    867             
    250.0       IN5531D           AXN pvt Ltd        IN215    785.65

当检查 df2['QR_score'] 和 df2['amt'] 的 dtype 为 float64 时,我已将其更改为 Object

我正在尝试的代码

import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz

df2 = df2[['QR_score','amt']].astype(str)
# Make Column Names Match
df1.columns = df2.columns
# Select string (object) columns
t1 = df1.select_dtypes(include='object')
t2 = df2.select_dtypes(include='object')
# Apply fuzz.ratio to every cell of both frames
obj_similarity = pd.DataFrame(np.vectorize(fuzz.ratio)(t1, t2), 
                              columns=t1.columns,
                              index=t1.index)
# Use non-object similarity with eq
other_similarity = df1.select_dtypes(exclude='object').eq(
    df2.select_dtypes(exclude='object')) * 100
# Merge Similarities together and take the average per row
total_similarity = pd.concat((
    obj_similarity, other_similarity
), axis=1).mean(axis=1)

df2['match_acc'] = total_similarity
<块引用>

在执行以下行时出现错误:

obj_similarity = pd.DataFrame(np.vectorize(fuzz.ratio)(t1, t2), 
                              columns=t1.columns,
                              index=t1.index)

Error:TypeError: object of type 'float' has no len()

请提出建议。

1 个答案:

答案 0 :(得分:1)

Stack 数据框 concat 它们和 apply fuzz(轴 = 1)。然后重组使用unstack,最后取mean(axis = 1)。

df2['match_acc'] = pd.concat([df1.stack(), df2.stack()], 1).apply(
    lambda x: fuzz.ratio(str(x[0]), str(x[1])), 1).unstack().mean(1)

输出:

   QR_score Identity_No             comp_name comp_code      amt  match_acc
0    200.00     IN2231D           AXN pvt Inc     IN225  2566.70      94.60
1    420.00     UK655IN        Aviva Intl Ltd     IN315  3677.00      89.20
2    350.35     SL2252H              Ship Inc     CK555      NaN      62.75
3    450.00     LK9978G  Oppo Mobiles pvt ltd     PRS95   367.90      82.20
4    590.15     NG5678J             Nokia Inc     RS885   867.00      94.60
5    250.00     IN5531D           AXN pvt Ltd     IN215   785.65      94.20
相关问题