假设我有一个Person
模型:
from django.db import models
class Person(models.Model):
name = models.CharField(max_length=256)
已经从现有数据源中填充了该模型的实例,我现在想将People
的相似name
合并在一起。
找到与给定名称相似的名称很简单(此处为“ Bob”):
Person.objects.all()
# Annotate each item with its unaccented similarity ranking with the given name
.annotate(similarity=TrigramSimilarity("name__unaccent", "Bob"))
# And filter out anything below the given threshold
.filter(similarity__gt=0.9)
这将生成(接近)以下内容:
SELECT "person"."id",
"person"."name",
SIMILARITY("cases_person"."name", 'Bob') AS "similarity"
FROM "cases_person"
WHERE SIMILARITY("cases_person"."name", 'Bob') > 0.9
但是,我想要的是类似人的分组。可以在Python中对上述内容进行一些修改来完成此操作:
from django.contrib.postgres.search import TrigramSimilarity
from .models import Person
people = Person.objects.all()
threshold = 0.9
people_ids_to_merge = []
processed = set()
for name in people.values_list("name", flat=True):
similar_people = (
# We must exclude any people that have already been processed, because
# this means they are already in people_ids_to_merge. If we didn't
# exclude them here, we would get duplicates in people_ids_to_merge
people.exclude(id__in=processed)
# Annotate each item with its unaccented similarity ranking with the current name
.annotate(similarity=TrigramSimilarity("name__unaccent", name))
# And filter out anything below the given threshold
.filter(similarity__gt=threshold)
)
num_similar_people = similar_people.count()
if num_similar_people > 1:
print(f"Found {num_similar_people} names similar to {name!r}")
ids = list(similar_people.values_list("id", flat=True))
people_ids_to_merge.append(ids)
processed.update(ids)
print("Groups of IDs of similar people:")
print(people_ids_to_merge)
示例输出:
Groups of IDs of similar people:
[[3, 8], [9, 17, 21]]
但是,这显然会导致对每个分组进行一次查询。有没有办法在PostgreSQL中本地执行此操作?还是在Python领域解决此问题的最佳方法?