Question

我有2个数据帧。使用系统上某些字段的当前值的API提取值。另一个具有这些字段的实际当前值。示例：系统上的名称和纸上的名称。我已经将两者合并在通用列上，但现在尝试比较Python上的名称，以查看它们是否近似匹配和/或是否需要更新。有办法吗？我相信这可以使用isnumber（search（...））在excel上完成。

不区分大小写，可能会考虑使用缩写词（我可以做字典吗？）来比较文本字符串

数据框外观和所需结果的示例：

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-0lax"></th>
    <th class="tg-0lax">Name on System</th>
    <th class="tg-0lax">Current Name</th>
    <th class="tg-0lax">Match</th>
  </tr>
  <tr>
    <td class="tg-0lax">1</td>
    <td class="tg-0lax">APPLE INFORMATION TECHNOLOGY</td>
    <td class="tg-0lax">Apple International Information Technology </td>
    <td class="tg-0lax">No</td>
  </tr>
  <tr>
    <td class="tg-0lax">2</td>
    <td class="tg-0lax">IBM Intl group</td>
    <td class="tg-0lax">IBM International Group</td>
    <td class="tg-0lax">YES</td>
  </tr>
</table>

PS。如果我违反了Stack社区的任何规则或礼节，请事先道歉，我对此并不陌生，可以接受学习和建设性的批评。

Answer 1

也许一个好方法是计算相似度并返回最高匹配概率？

首先，您需要进行一些数据清理，例如删除特殊字符，将所有字符串转换为小写，然后使用相似性进行搜索

st1 = 'apple information technology'
st2 = 'apple international information technology'

from difflib import SequenceMatcher
SequenceMatcher(None, st1, st2).ratio()

Answer 2

那么您可以在此处了解有关字符串相似性差异的更多信息 Find the similarity metric between two strings

我只是想为您提供应用程序，以防您想使用熊猫和度量标准尝试其他方法。

import pandas as pd
from difflib import SequenceMatcher
df=pd.DataFrame({'Name on System':['APPLE INFORMATION TECHNOLOGY','IBM Intl group'],'Current Name':['Apple International Information Technology','IBM International Group']})

在函数中定义指标

def similiarity_ratio(row):
    return SequenceMatcher(None, row['Name on System'].lower(), row['Current Name'].lower()).ratio() 

df['Match']=df.apply(lambda x:similiarity_ratio(x),axis=1)
print(df)

输出

Current Name                                    Name on System                 Match
0   Apple International Information Technology  APPLE INFORMATION TECHNOLOGY  0.800000
1   IBM International Group                     IBM Intl group                0.756757

匹配并比较Python Dataframe中的字符串

2 个答案: