到目前为止,根据我的能力,如果需要在具有大量条目的PostgreSQL数据库中的表上运行全文搜索(例如,顺序为1.2M +),建议的方法是创建该表的索引(在本例中,我们创建了GIN索引),该索引应允许您运行如下查询:
df.groupby(['brand] == 'volkswagen')['price'].mean()
此查询的结果有时不包含任何相关的搜索字符串,通常需要8到10秒的时间。
数据库部署在相当大的多核EC2实例上,所以我在想,是否有可能对数据库做其他事情来帮助这些查询更快地运行?
或者鉴于给我们要搜索的大量文件和文本(甚至是通过索引),这是否是合理的查询执行时间?
这是表格的样子:
SELECT * FROM speech WHERE speech_tsv @@ plainto_tsquery('a text string')
(显然,{ Table "public.speech"
Column | Type | Modifiers
---------------+-----------------------------+-----------------------------------------------------
speech_id | integer | not null default nextval('speech_id_seq'::regclass)
speechtype_id | smallint | not null
title | character varying | not null default ''::character varying
speechdate | date | default now()
location | character varying | not null default ''::character varying
source | character varying | not null default ''::character varying
speechtext | text | not null
url | character varying | not null default ''::character varying
release_id | smallint |
created | timestamp without time zone |
modified | timestamp without time zone |
speech_tsv | tsvector |
key | boolean |
summary | text |
quote | text |
Indexes:
"speech_pk" PRIMARY KEY, btree (speech_id)
"speech__release_id" btree (release_id)
"speech__speech_tsv" gin (speech_tsv)
"speech__speechdate" btree (speechdate)
"speech__speechtype_id" btree (speechtype_id)
是包含所有要搜索文本的列)