Question

我有一个元组列表，例如哈希和文件路径。我想找到所有重复项以及基于汉明距离的类似项目。我有一个haming距离得分的功能，我给出了值并得到了分数。

我坚持使用问题循环遍历列表并找到匹配的项目。

list = [('94ff39ad', '/path/to/file.jpg'), ('94ff39ad', '/path/to/file2.jpg'), ('94ff40ad', '/path/to/file3.jpg'), ('cab91acf', '/path/to/file4.jpg')]
score = haming_score(h1, h2)
# score_for_similar > 0.4

我需要一个字典，其中包含原始（路径）作为键，以及可能相似或重复的列表作为值。

像：

result = {'/path/to/file.jpg': ['/path/to/file2.jpg', '/path/to/file3.jpg'], '/path/to/file4.jpg': []}

第二个dict键值对{'/ path / to /'file4.jpg'：[]}不是必需的，但有帮助。

目前我在列表中循环两次并将值相互比较。但我得到了双重结果。

我会非常感谢你的帮助。

P.S。计算我使用的汉明距离得分：

def hamming_dist(h1, h2):
    h1 = list(h1)
    h2 = list(h2)
    score = scipy.spatial.distance.hamming(h1, h2)
    return score

Answer 1

import Levenshtein as leven

# This is your list
xlist = [('94ff39ad', '/path/to/file.jpg'), ('512asasd', '/somepath/to/file.jpg'), ('94ff39ad', '/path/to/file2.jpg'), ('94ff40ad', '/path/to/file3.jpg'), ('cab91acf', '/path/to/file4.jpg')]

# Here's what you'll base the difference upon:
simalarity_threshhold = 80 # 80%
results = {}

for item in xlist:
    path = item[1]
    print('Examining path:', path)

    max_similarity = None
    similarity_key = None
    for key in results:
        diff = leven.distance(path, key)
        diff_percentage = 100 / max(len(path), len(key)) * (min(len(path), len(key)-diff))

        print('    {} vs {} = {}%'.format(path, key, int(diff_percentage)))
        if diff_percentage > simalarity_threshhold:
            if not max_similarity or diff_percentage > max_similarity:
                max_similarity = diff_percentage
                similarity_key = key

    if not max_similarity:
        results[path] = {}
    else:
        results[similarity_key][path] = max_similarity

print(results)

如果两条路径的相似度超过彼此距离值的80％，则它们将配对为潜在匹配。如果另一条路径更相似，则会将其添加到该路径中。

如果路径的相似度低于80％，它将创建自己的结果路径/树/结构。

同样，这只是你如何解决它的一个例子有many ways to do it，但我更喜欢Levenshtein，因为它很容易使用且非常准确。

我在那里留下了一些调试代码，这样你就可以看到思考方式，再次传递了什么值，这完全取决于你的规则集，以确定匹配与否。

哦，我还将值存储为子词典。这样每个潜在的候选人都可以保留它在检查时获得的分数。您也可以将它们保存为列表。但是与字典相比，列表在迭代，比较和存储方面都非常慢。

第二个“哦”..这段代码没有经过回归测试..这里肯定存在一些逻辑问题。特别是在diff_percentage计算中..我不是指数学向导。但你抓住了我的漂移:)。

Answer 2

记录我如何解决问题并成为其他人的帮助，这是我的代码：

test = [('94ff39ad', '/path/to/file.jpg'), ('94ff39ad', '/path/to/file2.jpg'), ('94ff40ad', '/path/to/file3.jpg'), ('cab91acf', '/path/to/file4.jpg')]
seen = {}
new_seen = False
counter = 0
for x in test:
    added = False
    for k, v in seen.items():
        if x[0] == k or hamming_dist(x[0], k) < .4:
            v.append(x[1])
            added = True

    if not seen or not added:
        seen[x[0]] = [x[1]]

print(seen)

>> {'/path/to/file.jpg': ['/path/to/file2.jpg', '/path/to/file3.jpg'], '/path/to/file4.jpg': []}

浏览一个元组列表并在python中找到类似的值

2 个答案: