我有两个相同的列表。我想从列表1中获取第一个元素,并将其与列表2中的每个元素进行比较,完成后,我想从列表1中获取第二个元素,并重复进行直到直到两个列表中的每个元素都相互比较为止。
我创建了一个Levenshtein距离模型,并且能够成功地在我的第二个列表中循环1个字符串(我对此进行了硬编码)。但是,我将需要使其更加实用,并将目标字符串作为列表,并在将前一个元素与第二个列表进行比较之后将其切换到下一个元素。然后,我只希望它返回大于特定阈值ex的值。 80.00
my_list = address['Street'].tolist()
my_list
# Import numpy to perform the matrix algebra necessary to calculate the fuzzy match
import numpy as np
# Define a function that will become the fuzzy match
# I decided to use Levenshtein Distance due to the formulas ability to handle string comparisons of two unique lengths
def string_match(seq1, seq2, ratio_calc = False):
""" levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of seq1 and the
first j characters of seq2
"""
# Initialize matrix of zeros
rows = len(seq1)+1
cols = len(seq2)+1
distance = np.zeros((rows,cols),dtype = int)
# Populate matrix of zeros with the indeces of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k
# loop through the matrix to compute the cost of deletions,insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if seq1[row-1] == seq2[col-1]:
cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = round(((len(seq1)+len(seq2)) - distance[row][col]) / (len(seq1)+len(seq2)) * 100, 2)
return Ratio
else:
# print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
# insertions and/or substitutions
# This is the minimum number of edits needed to convert seq1 to seq2
return distance[row][col]
Prev_addrs = my_list
target_addr = "830 Amsterdam ave"
for addr in Prev_addrs:
distance = string_match(target_addr, addr, ratio_calc = True)
print(distance)
答案 0 :(得分:1)
忽略我认为您的问题中所有不相关的代码,这是从标题和第一段中如何完成我认为是您问题本质的方法。
import itertools
from pprint import pprint
def compare(a, b):
print('compare({}, {}) called'.format(a, b))
list1 = list('ABCD')
list2 = list('EFGH')
for a, b in itertools.product(list1, list2):
compare(a, b)
输出:
compare(A, E) called
compare(A, F) called
compare(A, G) called
compare(A, H) called
compare(B, E) called
compare(B, F) called
compare(B, G) called
compare(B, H) called
compare(C, E) called
compare(C, F) called
compare(C, G) called
compare(C, H) called
compare(D, E) called
compare(D, F) called
compare(D, G) called
compare(D, H) called