Question

我有一个带有某些列的数据框，即Cust_Id，Match Ratio，Search String和Match_ID。我想遍历所有Cust_Id并为所有Cust_Id生成匹配的地址记录。当前，对所有Cust_Id都重复最后Cust_Id的数据

我制作了用于文本匹配的脚本并生成了数据框

import pandas as pd
from fuzzywuzzy import process, fuzz

pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 10)

data = pd.read_csv(r"address_details.csv", skiprows=0)
id = data['COD_CUST_ID'].values.tolist()
address = data['ADDRESS'].values.tolist()

dict_list=[]

global dict_
dict_ = {}




for i in range(0,len(id)):


    for add in range(0,len(address)):
        score=process.extractBests(address[add], address, limit=len(address), score_cutoff=70)

        for sc in score:
            dict_.update({"Cust_Id": id[i]})
            for scr in sc:

                dict_.update({"Match Ratio":[sc]})
                dict_.update({"Search String":[sc]})
                dict_list.append(dict_)


df=pd.DataFrame(dict_list)
#print(df)

str_replace = df['Search String'] = df['Search String'].apply(lambda x: x[0][0])
#print(str_replace)



matches = df['Match Ratio'].tolist()
#print(matches)
matches = [x[0][0] for x in matches]
#print(matches)


found = []
#for s in df['Search String']:
for s in str_replace:
    data_list=[]

    if s in matches:
        index=[i for i, x in enumerate(matches) if x == s]
        Cust_Id = list([df['Cust_Id'][i]] for i in index)
        data_list.append(s)
        data_list.append(Cust_Id)
        found.append(data_list)
#print(found)

new_df= pd.DataFrame({"Match_ID":found})


#df['Match Ratio'] = df['Match Ratio'].apply(lambda x: x[0][1])
new_df['Match_ID'] = new_df['Match_ID'].apply(lambda x:x[1])


dataf=df.join(new_df)
print(dataf)

sd=dataf.to_csv("match_score.csv",sep=',',index=None)

目前我得到了这个输出

Cust_Id Match Ratio     Search String   Match_ID
6       [["abc",100]]   "abc"           [[6,6,6,6,6,6]]
6       [["abc",100]]   "abc"           [[6,6,6,6,6,6]]
6       [["abc",100]]   "abc"           [[6,6,6,6,6,6]]
6       [["abc",100]]   "abc"           [[6,6,6,6,6,6]]
6       [["abc",100]]   "abc"           [[6,6,6,6,6,6]]
6       [["abc",100]]   "abc"           [[6,6,6,6,6,6]]
6       [["abc",100]]   "abc"           [[6,6,6,6,6,6]]

我希望它像

Cust_Id Match Ratio     Search String   Match_ID
1       [("def",100)]   "def"           [[1,2,3]]
2       [("def",100)]   "def"           [[1,2,3]]
3       [("def",100)]   "def"           [[1,2,3]]
4       [("pqr",100)]   "pqr"           [[4]]
5       [("abc",100)]   "abc"           [[5,6]]
6       [("abc",100)]   "abc"           [[5,6]]

这只是一个示例输出。由于我有6个Cust_Id，并且每个Cust_Id与其他Cust_Id相匹配，因此每个Cust_Id记录应被打印6次。因此，总共有36条记录针对每个匹配的地址进行输出

遍历数据框

0 个答案: