如何在Python中的列表中查找字符串之间的相似性

时间:2019-05-07 10:51:50

标签: python string matching similarity

我正在比较Python中的两个数据框列,目的是为第一列的每个元素找到第二列的最佳匹配。第一列包含19.000行,我需要检查其中的每个字符串与第二列的最佳匹配是什么。因此,需要检查19.000行,每行19.000次,并考虑到字符串本身必须是另一个,而不是相同。

我从一个简单的比较开始,在列表中找到一个字符串,然后我成功了。然后,我将其应用于列表,只是为了比较它们,但由于比较字符串和列表,因此显然会出现错误“ TypeError:期望的字符串或类似字节的对象”。最后,我尝试创建一个循环,但是错误是相同的。有没有一种方法可以创建预期结果的列表?也许有更好的方法可以使用另一个库,但是到目前为止,我什么也没发现。这是当前的代码:

#simple example
from fuzzywuzzy import process
string = "appl"
compare = ["adfad.","apple","asple","tab"]
Ratios = process.extract(string,compare)
print(Ratios)
[('apple', 89), ('asple', 67), ('tab', 29), ('adfad.', 22)]

highest = process.extractOne(string,compare)
print(highest)
('apple', 89)

#data frame
from fuzzywuzzy import process
dataframecolumn = ["appl","tb"]
compare = ["adfad.","apple","asple","tab"]
Ratios = process.extract(dataframecolumn,compare)
TypeError: expected string or bytes-like object

#expected (but I need a list)
highest = process.extractOne(dataframecolumn[0],compare)
print(highest)
('apple', 89)
highest = process.extractOne(dataframecolumn[1],compare)
print(highest)
('tab', 80)

#Result expected
results = ["apple, 89","tab, 80"]

#Error
myl = ["appl","tb"]
compare = ["adfad.","apple","asple","tab"]
results = []
for x in myl:
    results.append(process.extractOne(myl,compare)[1])
TypeError: expected string or bytes-like object

1 个答案:

答案 0 :(得分:1)

from operator import itemgetter 

dataframecolumn = ["appl","tb"]
compare = ["adfad.","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
print ([max(ratios, key = itemgetter(1)) for ratios in Ratios])

# Or oneliner
#Ratios = [max(process.extract(x,compare),key = itemgetter(1)) for x in dataframecolumn]

如果extract将始终返回排序结果,那么我们可以避免调用max

Ratios = [process.extract(x, compare)[0] for x in dataframecolumn]

输出:

[('apple', 89), ('tab', 80)]

如果您想跳过完全匹配项,而只获得模糊匹配项,则跳过得分为100%的匹配项,并获得第一个非100%匹配项,因为它已经被排序。

dataframecolumn = ["apple","tb"]
compare = ["adfad","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
result = list()
for ratio in Ratios:
    for match in ratio:
        if match[1] != 100:
            result.append(match)
            break
print (result)