优化postgres中的查询

时间:2016-02-15 16:11:34

标签: sql postgresql query-optimization

我有一个表(doc(varchar),sentenceId(int),bow(varchar []))。在弓柱我导入一袋字。总共有4623行。表(nlptable)看起来像(对于10个向量):

doc    | sentenceId | bow  
-------------------------------------------
corpus |      1     | {1,0,0,1,0,0,0,1,0,1}  
corpus |      2     | {0,1,1,1,0,1,0,0,0,0}

我运行的查询(比较两个句子的单词表示):

select a.doc, a.sentenceid, b.doc, b.sentenceid, 
cardinality(array(select unnest(array_positions(a.bow, '1')) intersect select unnest(array_positions(b.bow, '1'))))::float / cardinality(a.bow)::float
from 
  nlptable a, nlptable b
where 
  a.sentenceid < b.sentenceid;

问题在于,对于10个向量(10个最常见的单词),执行时间约为3分钟,100个向量约为25分钟,500个向量为80分钟。我必须计算10,000个最常见的单词,这可能需要1天。查询太慢了,我想优化它,但现在想法如何。是否有选项可以使用某种布尔类型而不是varchar []来减少数据大小?我也做了解释分析:

Nested Loop  (cost=0.00..25486703.85 rows=7124043 width=232) (actual time=2.028..1126759.206 rows=10683753 loops=1)
   Join Filter: (a.sentenceid < b.sentenceid)
   Rows Removed by Join Filter: 10688376
   ->  Seq Scan on nlpdata a  (cost=0.00..214.23 rows=4623 width=116) (actual time=0.261..9.486 rows=4623 loops=1)
   ->  Materialize  (cost=0.00..237.35 rows=4623 width=116) (actual time=0.001..3.509 rows=4623 loops=4623)
         ->  Seq Scan on nlpdata b  (cost=0.00..214.23 rows=4623 width=116) (actual time=0.207..6.968 rows=4623 loops=1)
   SubPlan 1
         ->  HashSetOp Intersect  (cost=0.00..3.52 rows=1 width=0) (actual time=0.079..0.081 rows=2 loops=10683753)
           ->  Append  (cost=0.00..3.02 rows=200 width=0) (actual time=0.017..0.062 rows=8 loops=10683753)
             ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..1.51 rows=100 width=0) (actual time=0.014..0.025 rows=4 loops=10683753)
                   ->  Result  (cost=0.00..0.51 rows=100 width=0) (actual time=0.012..0.018 rows=4 loops=10683753)

我对postgres比较陌生,所以不知道优化。我还听说postgres中的数组很重要。因此,我搜索了在表格中存储单词包的替代方法,但找不到数组以外的功能。

0 个答案:

没有答案