如何对模糊逻辑匹配80%的数组中的值进行分组
combined_list = ['magic', 'simple power', 'matrix', 'simple aa', 'madness', 'magics', 'mgcsa', 'simple pws', 'seek', 'dour', 'softy']
的产率:
['magic, magics'], ['simple pws', 'simple aa'], ['simple power'], [matrix]
这是我所取得的成就,但与我的目标截然不同。另外它只支持很少的值,我打算用大约50,000条记录来运行它
from difflib import SequenceMatcher as sm
combined_list = ['magic', 'simple power', 'matrix', 'madness', 'magics', 'mgcsa', 'simple pws', 'seek', 'sour', 'soft']
result = list()
result_group = list()
for x in combined_list:
for name in combined_list:
if(sm(None, x, name).ratio() >= 0.80):
result_group.append(name)
else:
pass
result.append(result_group)
print(result)
del result_group[:]
print(result)
循环外的打印结果为空,但循环内的结果包含我需要的值。虽然输出与我需要的不同
['magic', 'magics']]
[['simple power', 'simple pws'], ['simple power', 'simple pws']]
[['matrix'], ['matrix'], ['matrix']]
[['madness'], ['madness'], ['madness'], ['madness']]
[['magic', 'magics'], ['magic', 'magics'], ['magic', 'magics'], ['magic', 'magics'], ['magic', 'magics']]
[['mgcsa'], ['mgcsa'], ['mgcsa'], ['mgcsa'], ['mgcsa'], ['mgcsa']]
[['simple power', 'simple pws'], ['simple power', 'simple pws'], ['simple power', 'simple pws'], ['simple power', 'simple pws'], ['simple power', 'simple pws'], ['simple power', 'simple pws'], ['simple power', 'simple pws']]
[['seek'], ['seek'], ['seek'], ['seek'], ['seek'], ['seek'], ['seek'], ['seek']]
[['sour'], ['sour'], ['sour'], ['sour'], ['sour'], ['sour'], ['sour'], ['sour'], ['sour']]
[['soft'], ['soft'], ['soft'], ['soft'], ['soft'], ['soft'], ['soft'], ['soft'], ['soft'], ['soft']]
[['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa'], ['simple aa']]
[[], [], [], [], [], [], [], [], [], [], []]
答案 0 :(得分:2)
问题在于:
result.append(result_group)
print(result)
del result_group[:]
您可以在结果中附加一个列表,但由于列表是可变类型,因此只存储一个引用。因此,当您更改原始列表(result_group
)时,您也会更改result
中的引用,在您的情况下删除所有元素。相反,像这样复制它:
result.append(result_group[:])
print(result)
del result_group[:]
或者不要删除列表元素,但为每次迭代创建一个新列表:
for x in combined_list:
result_group = []
for name in combined_list:
...
result.append(result_group)
修改:如果您想摆脱重复项,请尝试使用集合而不是列表:
# result = list()
result = set([])
...
# result.append(result_group)
result.add(tuple(result_group))
集总是包含唯一成员,但是,由于列表是不可清除的,因此您需要先将它们转换为元组。
Edit2 :将所有内容放在一起并检查2个以上成员的实际群组:
from difflib import SequenceMatcher as sm
combined_list = ['magic', 'simple power', 'matrix', 'madness',
'magics', 'mgcsa', 'simple pws', 'seek', 'sour', 'soft']
# using a set ensures there are no duplicates
result = set([])
for x in combined_list:
result_group = []
for name in combined_list:
if(sm(None, x, name).ratio() >= 0.80):
result_group.append(name)
if len(result_group) > 1: # this gets rid of single-word groups
result.add(tuple(result_group))
print(result)
答案 1 :(得分:1)
from difflib import SequenceMatcher as sm
combined_list = ['magic', 'simple power', 'matrix', 'madness', 'magics',
'mgcsa', 'simple pws', 'seek', 'sour', 'soft']
result = list()
result_group = list()
usedElements = list()
skip = False
for firstName in combined_list:
skip = False
for x in usedElements:
if x == firstName:
skip = True
if skip == True:
continue
for secondName in combined_list:
if(sm(None, firstName, secondName).ratio() >= 0.80):
result_group.append(secondName)
usedElements.append(secondName)
else:
pass
result.append(result_group[:])
del result_group[:]
print(result)
我添加了一种方法来删除重复项,方法是将列表中已经放入组中的元素扔到usedElements列表中。
它确实保留了一个组,但是如果你不希望组中没有元素,你可以将最后一段代码更改为:
if len(result_group) > 1:
result.append(result_group[:])
del result_group[:]
del result_group[:]
print(result)
希望这有帮助。
答案 2 :(得分:0)
from difflib import SequenceMatcher as sm
combined_list = ['magic', 'simple power', 'matrix', 'madness', 'magics', 'mgcsa', 'simple pws', 'seek', 'sour', 'soft']
combined_list.sort()
def getPairs(combined_list):
results = list()
grouped = set()
for x in combined_list:
result_group = list()
if(grouped.__contains__(x)):
continue
for name in combined_list:
if(sm(None, x, name).ratio() >= 0.80):
result_group.append(name)
grouped.add(name);
else:
pass;
results.append(result_group)
return results;
print(getPairs(combined_list))