Question

假设我有一个Person模型：

from django.db import models

class Person(models.Model):
    name = models.CharField(max_length=256)

已经从现有数据源中填充了该模型的实例，我现在想将People的相似name合并在一起。

找到与给定名称相似的名称很简单（此处为“ Bob”）：

Person.objects.all()
# Annotate each item with its unaccented similarity ranking with the given name
.annotate(similarity=TrigramSimilarity("name__unaccent", "Bob"))
# And filter out anything below the given threshold
.filter(similarity__gt=0.9)

这将生成（接近）以下内容：

SELECT "person"."id",
       "person"."name",
       SIMILARITY("cases_person"."name", 'Bob') AS "similarity"
  FROM "cases_person"
 WHERE SIMILARITY("cases_person"."name", 'Bob') > 0.9

但是，我想要的是类似人的分组。可以在Python中对上述内容进行一些修改来完成此操作：

from django.contrib.postgres.search import TrigramSimilarity 
from .models import Person

people = Person.objects.all()
threshold = 0.9
people_ids_to_merge = []
processed = set()
for name in people.values_list("name", flat=True):
    similar_people = (
        # We must exclude any people that have already been processed, because
        # this means they are already in people_ids_to_merge. If we didn't
        # exclude them here, we would get duplicates in people_ids_to_merge
        people.exclude(id__in=processed)
        # Annotate each item with its unaccented similarity ranking with the current name
        .annotate(similarity=TrigramSimilarity("name__unaccent", name))
        # And filter out anything below the given threshold
        .filter(similarity__gt=threshold)
    )
    num_similar_people = similar_people.count()
    if num_similar_people > 1:
        print(f"Found {num_similar_people} names similar to {name!r}")
        ids = list(similar_people.values_list("id", flat=True))
        people_ids_to_merge.append(ids)
        processed.update(ids)

print("Groups of IDs of similar people:")
print(people_ids_to_merge)

示例输出：

Groups of IDs of similar people:
[[3, 8], [9, 17, 21]]

但是，这显然会导致对每个分组进行一次查询。有没有办法在PostgreSQL中本地执行此操作？还是在Python领域解决此问题的最佳方法？

在Django / PostgreSQL中按字段相似度对模型进行分组

0 个答案: