我正在尝试利用Django(v2.1)和Postgres(9.5)创建用于地址自动完成功能的全文搜索,但是该性能目前不适合自动完成功能,我不知道了解我获得的性能结果背后的逻辑。对于信息表,该表相当大,有1400万行。
我的模特:
from django.db import models
from postgres_copy import CopyManager
from django.contrib.postgres.indexes import GinIndex
class Addresses(models.Model):
date_update = models.DateTimeField(auto_now=True, null=True)
longitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
latitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
number = models.CharField(max_length=16, null=True, default='')
street = models.CharField(max_length=60, null=True, default='')
unit = models.CharField(max_length=50, null=True, default='')
city = models.CharField(max_length=50, null=True, default='')
district = models.CharField(max_length=10, null=True, default='')
region = models.CharField(max_length=5, null=True, default='')
postcode = models.CharField(max_length=5, null=True, default='')
addr_id = models.CharField(max_length=20, unique=True)
addr_hash = models.CharField(max_length=20, unique=True)
objects = CopyManager()
class Meta:
indexes = [
GinIndex(fields=['number', 'street', 'unit', 'city', 'region', 'postcode'], name='search_idx')
]
我创建了一个小测试以根据搜索中的单词数检查性能:
search_vector = SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode')
searchtext1 = "north"
searchtext2 = "north bondi"
searchtext3 = "north bondi blair"
searchtext4 = "north bondi blair street 2026"
print('Test1: 1 word')
start_time = time.time()
result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext1)[:10]
#print(len(result))
time_exec = str(timedelta(seconds=time.time() - start_time))
print(time_exec)
print(' ')
#print(AddressesAustralia.objects.annotate(search=search_vector).explain(verbose=True))
print('Test2: 2 words')
start_time = time.time()
result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext2)[:10]
#print(len(result))
time_exec = str(timedelta(seconds=time.time() - start_time))
print(time_exec)
print(' ')
print('Test3: 3 words')
start_time = time.time()
result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext3)[:10]
#print(len(result))
time_exec = str(timedelta(seconds=time.time() - start_time))
print(time_exec)
print(' ')
print('Test4: 5 words')
start_time = time.time()
result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext4)[:10]
#print(len(result))
time_exec = str(timedelta(seconds=time.time() - start_time))
print(time_exec)
print(' ')
我得到以下结果,这似乎是正确的:
Test1: 1 word
0:00:00.001841
Test2: 2 words
0:00:00.001422
Test3: 3 words
0:00:00.001574
Test4: 5 words
0:00:00.001360
但是,如果我取消注释print(len(results))行,则会得到以下结果:
Test1: 1 word
10
0:00:00.046392
Test2: 2 words
10
0:00:06.544732
Test3: 3 words
10
0:01:12.367157
Test4: 5 words
10
0:01:17.786596
这显然不适合自动完成功能。
有人可以解释为什么对查询集结果执行操作时花更长的时间吗?看来数据库检索总是很快速,但是遍历结果需要时间,这对我来说没有意义,因为我将结果限制为10,返回的查询集始终是相同大小。
此外,尽管我创建了GIN索引,但似乎未使用该索引。看来它已经正确创建了:
=# \d public_data_au_addresses
Table
"public.public_data_au_addresses"
Column | Type | Collation | Nullable |
Default
-------------+--------------------------+-----------+----------+------
---------------------------------------------------------
id | integer | | not null |
nextval('public_data_au_addresses_id_seq'::regclass)
date_update | timestamp with time zone | | |
longitude | numeric(9,6) | | |
latitude | numeric(9,6) | | |
number | character varying(16) | | |
street | character varying(60) | | |
unit | character varying(50) | | |
city | character varying(50) | | |
district | character varying(10) | | |
region | character varying(5) | | |
postcode | character varying(5) | | |
addr_id | character varying(20) | | not null |
addr_hash | character varying(20) | | not null |
Indexes:
"public_data_au_addresses_pkey" PRIMARY KEY, btree (id)
"public_data_au_addresses_addr_hash_key" UNIQUE CONSTRAINT, btree (addr_hash)
"public_data_au_addresses_addr_id_key" UNIQUE CONSTRAINT, btree (addr_id)
"public_data_au_addresses_addr_hash_e8c67a89_like" btree (addr_hash varchar_pattern_ops)
"public_data_au_addresses_addr_id_9ee00c76_like" btree (addr_id varchar_pattern_ops)
"search_idx" gin (number, street, unit, city, region, postcode)
当我在查询上运行explain()方法时,我得到了:
Test1: 1 word
Limit (cost=0.00..1110.60 rows=10 width=140)
-> Seq Scan on public_data_au_addresses (cost=0.00..8081472.41 rows=72767 width=140)
Filter: (to_tsvector((((((((((((COALESCE(number, ''::character varying))::text || ' '::text) || (COALESCE(street, ''::character varying))::text) || ' '::text) || (COALESCE(unit, ''::character varying))::text) || ' '::text) || (COALESCE(city, ''::character varying))::text) || ' '::text) || (COALESCE(region, ''::character varying))::text) || ' '::text) || (COALESCE(postcode, ''::character varying))::text)) @@ plainto_tsquery('north'::text))
因此它仍然显示顺序扫描而不是使用索引扫描。有人知道如何解决或调试吗?
无论如何要搜索这么多字段,GIN索引仍然有效吗?
最后,还有谁知道我可以如何改进代码以进一步提高性能?
谢谢! 问候
我试图按照下面Paolo的建议创建一个搜索向量,但是似乎搜索仍然是顺序的,没有利用GIN索引。
class AddressesQuerySet(CopyQuerySet):
def update_search_vector(self):
return self.update(search_vector=SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode', config='english'))
class AddressesAustralia(models.Model):
date_update = models.DateTimeField(auto_now=True, null=True)
longitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
latitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
number = models.CharField(max_length=16, null=True, default='')
street = models.CharField(max_length=60, null=True, default='')
unit = models.CharField(max_length=50, null=True, default='')
city = models.CharField(max_length=50, null=True, default='')
district = models.CharField(max_length=10, null=True, default='')
region = models.CharField(max_length=5, null=True, default='')
postcode = models.CharField(max_length=5, null=True, default='')
addr_id = models.CharField(max_length=20, unique=True)
addr_hash = models.CharField(max_length=20, unique=True)
search_vector = SearchVectorField(null=True, editable=False)
objects = AddressesQuerySet.as_manager()
class Meta:
indexes = [
GinIndex(fields=['search_vector'], name='search_vector_idx')
]
然后我使用update命令更新了search_vector字段:
AddressesAustralia.objects.update_search_vector()
然后我运行了一个查询,以使用相同的搜索向量进行测试:
class Command(BaseCommand):
def handle(self, *args, **options):
search_vector = SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode', config='english')
searchtext1 = "north"
print('Test1: 1 word')
start_time = time.time()
result = AddressesAustralia.objects.filter(search_vector=searchtext1)[:10].explain(verbose=True)
print(len(result))
print(result)
time_exec = str(timedelta(seconds=time.time() - start_time))
print(time_exec)
我得到以下结果,仍然显示顺序搜索:
Test1: 1 word
532
Limit (cost=0.00..120.89 rows=10 width=235)
Output: id, date_update, longitude, latitude, number, street, unit, city, district, region, postcode, addr_id, addr_hash, search_vector
-> Seq Scan on public.public_data_au_addressesaustralia (cost=0.00..5061078.91 rows=418651 width=235)
Output: id, date_update, longitude, latitude, number, street, unit, city, district, region, postcode, addr_id, addr_hash, search_vector
Filter: (public_data_au_addressesaustralia.search_vector @@ plainto_tsquery('north'::text))
0:00:00.075262
我也尝试过:
在搜索向量中(无论在更新中还是在查询中)有无config =“ english”
要删除GIN索引,然后重新创建它,然后重新运行update_search_Vector
但结果仍然相同。对我做错了什么或如何进一步排除故障有任何想法吗?
答案 0 :(得分:1)
@knbk已经建议提高性能,您必须阅读 Django 文档中的Full-text search Performance部分。
“如果这种方法变得太慢,则可以在模型中添加 SearchVectorField 。”
在您的代码中,您可以在模型中添加具有相关GIN索引的搜索向量字段,并在查询集中添加用于更新该字段的新方法:
from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVector, SearchVectorField
from django.db import models
from postgres_copy import CopyQuerySet
class AddressesQuerySet(CopyQuerySet):
def update_search_vector(self):
return self.update(search_vector=SearchVector(
'number', 'street', 'unit', 'city', 'region', 'postcode'
))
class Addresses(models.Model):
date_update = models.DateTimeField(auto_now=True, null=True)
longitude = models.DecimalField(max_digits=9, decimal_places=6, null=True)
latitude = models.DecimalField(max_digits=9, decimal_places=6, null=True)
number = models.CharField(max_length=16, null=True, default='')
street = models.CharField(max_length=60, null=True, default='')
unit = models.CharField(max_length=50, null=True, default='')
city = models.CharField(max_length=50, null=True, default='')
district = models.CharField(max_length=10, null=True, default='')
region = models.CharField(max_length=5, null=True, default='')
postcode = models.CharField(max_length=5, null=True, default='')
addr_id = models.CharField(max_length=20, unique=True)
addr_hash = models.CharField(max_length=20, unique=True)
search_vector = SearchVectorField(null=True, editable=False)
objects = AddressesQuerySet.as_manager()
class Meta:
indexes = [
GinIndex(fields=['search_vector'], name='search_vector_idx')
]
您可以使用新的queryset方法更新新的搜索向量字段:
>>> Addresses.objects.update_search_vector()
UPDATE "addresses_addresses"
SET "search_vector" = to_tsvector(
COALESCE("addresses_addresses"."number", '') || ' ' ||
COALESCE("addresses_addresses"."street", '') || ' ' ||
COALESCE("addresses_addresses"."unit", '') || ' ' ||
COALESCE("addresses_addresses"."city", '') || ' ' ||
COALESCE("addresses_addresses"."region", '') || ' ' ||
COALESCE("addresses_addresses"."postcode", '')
)
如果执行查询并阅读说明,您会看到使用的GIN索引:
>>> print(Addresses.objects.filter(search_vector='north').values('id').explain(verbose=True))
EXPLAIN (VERBOSE true)
SELECT "addresses_addresses"."id"
FROM "addresses_addresses"
WHERE "addresses_addresses"."search_vector" @@ (plainto_tsquery('north')) = true [0.80ms]
Bitmap Heap Scan on public.addresses_addresses (cost=12.25..16.52 rows=1 width=4)
Output: id
Recheck Cond: (addresses_addresses.search_vector @@ plainto_tsquery('north'::text))
-> Bitmap Index Scan on search_vector_idx (cost=0.00..12.25 rows=1 width=0)
Index Cond: (addresses_addresses.search_vector @@ plainto_tsquery('north'::text))
如果您想进一步加深,可以阅读我写的关于该主题的文章:
“ Full-Text Search in Django with PostgreSQL”
我尝试执行Django ORM生成的SQL: http://sqlfiddle.com/#!17/f9aa9/1
答案 1 :(得分:0)
您需要在搜索向量上创建一个functional index。现在,您在基础字段上有一个索引,但是它仍然必须为每一行创建搜索向量,然后才能过滤结果。这就是为什么它要进行顺序扫描。
Django当前不支持Meta.indexes
中的功能索引,因此您需要手动创建它,例如使用RunSQL
operation。
RunSQL(
"""
CREATE INDEX ON public_data_au_addresses USING GIN
(to_tsvector(...))
"""
)
to_tsvector()
表达式必须与查询中使用的表达式匹配。请务必通读Postgres docs了解所有详细信息。