我正在比较Python中的两个数据框列,目的是为第一列的每个元素找到第二列的最佳匹配。第一列包含19.000行,我需要检查其中的每个字符串与第二列的最佳匹配是什么。因此,需要检查19.000行,每行19.000次,并考虑到字符串本身必须是另一个,而不是相同。
我从一个简单的比较开始,在列表中找到一个字符串,然后我成功了。然后,我将其应用于列表,只是为了比较它们,但由于比较字符串和列表,因此显然会出现错误“ TypeError:期望的字符串或类似字节的对象”。最后,我尝试创建一个循环,但是错误是相同的。有没有一种方法可以创建预期结果的列表?也许有更好的方法可以使用另一个库,但是到目前为止,我什么也没发现。这是当前的代码:
#simple example
from fuzzywuzzy import process
string = "appl"
compare = ["adfad.","apple","asple","tab"]
Ratios = process.extract(string,compare)
print(Ratios)
[('apple', 89), ('asple', 67), ('tab', 29), ('adfad.', 22)]
highest = process.extractOne(string,compare)
print(highest)
('apple', 89)
#data frame
from fuzzywuzzy import process
dataframecolumn = ["appl","tb"]
compare = ["adfad.","apple","asple","tab"]
Ratios = process.extract(dataframecolumn,compare)
TypeError: expected string or bytes-like object
#expected (but I need a list)
highest = process.extractOne(dataframecolumn[0],compare)
print(highest)
('apple', 89)
highest = process.extractOne(dataframecolumn[1],compare)
print(highest)
('tab', 80)
#Result expected
results = ["apple, 89","tab, 80"]
#Error
myl = ["appl","tb"]
compare = ["adfad.","apple","asple","tab"]
results = []
for x in myl:
results.append(process.extractOne(myl,compare)[1])
TypeError: expected string or bytes-like object
答案 0 :(得分:1)
from operator import itemgetter
dataframecolumn = ["appl","tb"]
compare = ["adfad.","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
print ([max(ratios, key = itemgetter(1)) for ratios in Ratios])
# Or oneliner
#Ratios = [max(process.extract(x,compare),key = itemgetter(1)) for x in dataframecolumn]
如果extract
将始终返回排序结果,那么我们可以避免调用max
Ratios = [process.extract(x, compare)[0] for x in dataframecolumn]
输出:
[('apple', 89), ('tab', 80)]
如果您想跳过完全匹配项,而只获得模糊匹配项,则跳过得分为100%的匹配项,并获得第一个非100%匹配项,因为它已经被排序。
dataframecolumn = ["apple","tb"]
compare = ["adfad","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
result = list()
for ratio in Ratios:
for match in ratio:
if match[1] != 100:
result.append(match)
break
print (result)