我在一个csv文件中有一个数据,该文件基本上具有一些ID,它们的对应地址以及1个地址与其他地址的匹配相似率。我想确定地址相似的ID及其匹配百分比
我已经完成了文本匹配,找到了将1个地址与其他每个地址进行比较的地址字符串之间的相似度百分比。
import pandas as pd
from fuzzywuzzy import process, fuzz
pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 10)
data = pd.read_csv(r"address_details.csv", skiprows=0)
id = data['COD_CUST_ID'].values.tolist()
address = data['ADDRESS'].values.tolist()
dict_list=[]
for i in range(0,len(id)):
for add in range(0,len(address)):
score=process.extractBests(address[add], address, limit=len(address), score_cutoff=40)
#print(type(score))
for sc in score:
#print(sc)
for scr in sc:
print(scr)
dict_={}
dict_.update({"Cust_Id": id[i]})
dict_.update({"Match Ratio": sc})
dict_.update({"Search String": address[add]})
#dict_.update({"Address List": address})
dict_list.append(dict_)
df=pd.DataFrame(dict_list)
matches = df['Match Ratio'].tolist()
matches = [x[0][0] for x in matches]
found = []
for s in df['Search String']:
data_list=[]
if s in matches:
index=[i for i, x in enumerate(matches) if x == s]
Cust_Id = list([df['Cust_Id'][i]] for i in index)
data_list.append(s)
data_list.append(Cust_Id)
found.append(data_list)
print(found)
sd=df.to_csv("match_score.csv",sep=',',index=None)
假设我将此数据帧作为代码输出
Cust_Id Match Ratio Search String
1 [('ABC', 100)] ABC
2 [('DEF', 100)] DEF
3 [('DEF', 100)] XYZ
4 [('ABC', 100)] PQR
5 [('PQR', 100)] TUV
6 [('DEF', 100)] LMN
我想在“匹配比率”列下获取具有类似数据的IDS列表
答案 0 :(得分:1)
我编写了一个代码,该代码给出了一个包含“搜索字符串”的列表,它对应的是匹配的“ Cust_Id”。
代码是
import pandas as pd
def duplicates(lst, item):
return [i for i, x in enumerate(lst) if x == item]
# Creating Data frame
data = {'Cust_Id' : ['1 ','2' , '3','4','5','6'],
'Match Ratio' : [[('ABC', 100)],[('DEF', 100)],[('DEF', 100)], [('ABC', 100)],[('PQR', 100)],[('DEF', 100)]],
'Search' : ['ABC','DEF','XYZ','PQR','TUV','LMN']
}
df = pd.DataFrame(data)
print(df)
# Creating a list of 1'st value of tuple Match Ratio
matches = df['Match Ratio'].tolist()
matches = [x[0][0] for x in matches]
found = []
for s in df['Search']:
data_list = []
if s in matches:
index = duplicates(matches,s)
Cust_Id = list([df['Cust_Id'][i]] for i in index)
data_list.append(s)
data_list.append(Cust_Id)
found.append(data_list)
print(found)
数据帧输出
Cust_Id Match Ratio Search
0 1 [(ABC, 100)] ABC
1 2 [(DEF, 100)] DEF
2 3 [(DEF, 100)] XYZ
3 4 [(ABC, 100)] PQR
4 5 [(PQR, 100)] TUV
5 6 [(DEF, 100)] LMN
发现列表输出
[['ABC', [['1 '], ['4']]], ['DEF', [['2'], ['3'], ['6']]], ['PQR', [['5']]]]
希望您能找到想要的东西:)