我有一个带有某些列的数据框,即Cust_Id
,Match Ratio
,Search String
和Match_ID
。我想遍历所有Cust_Id
并为所有Cust_Id
生成匹配的地址记录。当前,对所有Cust_Id
都重复最后Cust_Id
的数据
我制作了用于文本匹配的脚本并生成了数据框
import pandas as pd
from fuzzywuzzy import process, fuzz
pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 10)
data = pd.read_csv(r"address_details.csv", skiprows=0)
id = data['COD_CUST_ID'].values.tolist()
address = data['ADDRESS'].values.tolist()
dict_list=[]
global dict_
dict_ = {}
for i in range(0,len(id)):
for add in range(0,len(address)):
score=process.extractBests(address[add], address, limit=len(address), score_cutoff=70)
for sc in score:
dict_.update({"Cust_Id": id[i]})
for scr in sc:
dict_.update({"Match Ratio":[sc]})
dict_.update({"Search String":[sc]})
dict_list.append(dict_)
df=pd.DataFrame(dict_list)
#print(df)
str_replace = df['Search String'] = df['Search String'].apply(lambda x: x[0][0])
#print(str_replace)
matches = df['Match Ratio'].tolist()
#print(matches)
matches = [x[0][0] for x in matches]
#print(matches)
found = []
#for s in df['Search String']:
for s in str_replace:
data_list=[]
if s in matches:
index=[i for i, x in enumerate(matches) if x == s]
Cust_Id = list([df['Cust_Id'][i]] for i in index)
data_list.append(s)
data_list.append(Cust_Id)
found.append(data_list)
#print(found)
new_df= pd.DataFrame({"Match_ID":found})
#df['Match Ratio'] = df['Match Ratio'].apply(lambda x: x[0][1])
new_df['Match_ID'] = new_df['Match_ID'].apply(lambda x:x[1])
dataf=df.join(new_df)
print(dataf)
sd=dataf.to_csv("match_score.csv",sep=',',index=None)
目前我得到了这个输出
Cust_Id Match Ratio Search String Match_ID
6 [["abc",100]] "abc" [[6,6,6,6,6,6]]
6 [["abc",100]] "abc" [[6,6,6,6,6,6]]
6 [["abc",100]] "abc" [[6,6,6,6,6,6]]
6 [["abc",100]] "abc" [[6,6,6,6,6,6]]
6 [["abc",100]] "abc" [[6,6,6,6,6,6]]
6 [["abc",100]] "abc" [[6,6,6,6,6,6]]
6 [["abc",100]] "abc" [[6,6,6,6,6,6]]
我希望它像
Cust_Id Match Ratio Search String Match_ID
1 [("def",100)] "def" [[1,2,3]]
2 [("def",100)] "def" [[1,2,3]]
3 [("def",100)] "def" [[1,2,3]]
4 [("pqr",100)] "pqr" [[4]]
5 [("abc",100)] "abc" [[5,6]]
6 [("abc",100)] "abc" [[5,6]]
这只是一个示例输出。由于我有6个Cust_Id
,并且每个Cust_Id
与其他Cust_Id
相匹配,因此每个Cust_Id
记录应被打印6次。因此,总共有36条记录针对每个匹配的地址进行输出