在Django / PostgreSQL中按字段相似度对模型进行分组

时间:2019-01-16 20:33:15

标签: python django postgresql postgresql-9.6 django-2.1

假设我有一个Person模型:

from django.db import models

class Person(models.Model):
    name = models.CharField(max_length=256)

已经从现有数据源中填充了该模型的实例,我现在想将People的相似name合并在一起。

找到与给定名称相似的名称很简单(此处为“ Bob”):

Person.objects.all()
# Annotate each item with its unaccented similarity ranking with the given name
.annotate(similarity=TrigramSimilarity("name__unaccent", "Bob"))
# And filter out anything below the given threshold
.filter(similarity__gt=0.9)

这将生成(接近)以下内容:

SELECT "person"."id",
       "person"."name",
       SIMILARITY("cases_person"."name", 'Bob') AS "similarity"
  FROM "cases_person"
 WHERE SIMILARITY("cases_person"."name", 'Bob') > 0.9

但是,我想要的是类似人的分组。可以在Python中对上述内容进行一些修改来完成此操作:

from django.contrib.postgres.search import TrigramSimilarity 
from .models import Person

people = Person.objects.all()
threshold = 0.9
people_ids_to_merge = []
processed = set()
for name in people.values_list("name", flat=True):
    similar_people = (
        # We must exclude any people that have already been processed, because
        # this means they are already in people_ids_to_merge. If we didn't
        # exclude them here, we would get duplicates in people_ids_to_merge
        people.exclude(id__in=processed)
        # Annotate each item with its unaccented similarity ranking with the current name
        .annotate(similarity=TrigramSimilarity("name__unaccent", name))
        # And filter out anything below the given threshold
        .filter(similarity__gt=threshold)
    )
    num_similar_people = similar_people.count()
    if num_similar_people > 1:
        print(f"Found {num_similar_people} names similar to {name!r}")
        ids = list(similar_people.values_list("id", flat=True))
        people_ids_to_merge.append(ids)
        processed.update(ids)

print("Groups of IDs of similar people:")
print(people_ids_to_merge)

示例输出:

Groups of IDs of similar people:
[[3, 8], [9, 17, 21]]

但是,这显然会导致对每个分组进行一次查询。有没有办法在PostgreSQL中本地执行此操作?还是在Python领域解决此问题的最佳方法?

0 个答案:

没有答案