Question

我有一个表（doc（varchar），sentenceId（int），bow（varchar []））。在弓柱我导入一袋字。总共有4623行。表（nlptable）看起来像（对于10个向量）：

doc    | sentenceId | bow  
-------------------------------------------
corpus |      1     | {1,0,0,1,0,0,0,1,0,1}  
corpus |      2     | {0,1,1,1,0,1,0,0,0,0}

我运行的查询（比较两个句子的单词表示）：

select a.doc, a.sentenceid, b.doc, b.sentenceid, 
cardinality(array(select unnest(array_positions(a.bow, '1')) intersect select unnest(array_positions(b.bow, '1'))))::float / cardinality(a.bow)::float
from 
  nlptable a, nlptable b
where 
  a.sentenceid < b.sentenceid;

问题在于，对于10个向量（10个最常见的单词），执行时间约为3分钟，100个向量约为25分钟，500个向量为80分钟。我必须计算10,000个最常见的单词，这可能需要1天。查询太慢了，我想优化它，但现在想法如何。是否有选项可以使用某种布尔类型而不是varchar []来减少数据大小？我也做了解释分析：

Nested Loop  (cost=0.00..25486703.85 rows=7124043 width=232) (actual time=2.028..1126759.206 rows=10683753 loops=1)
   Join Filter: (a.sentenceid < b.sentenceid)
   Rows Removed by Join Filter: 10688376
   ->  Seq Scan on nlpdata a  (cost=0.00..214.23 rows=4623 width=116) (actual time=0.261..9.486 rows=4623 loops=1)
   ->  Materialize  (cost=0.00..237.35 rows=4623 width=116) (actual time=0.001..3.509 rows=4623 loops=4623)
         ->  Seq Scan on nlpdata b  (cost=0.00..214.23 rows=4623 width=116) (actual time=0.207..6.968 rows=4623 loops=1)
   SubPlan 1
         ->  HashSetOp Intersect  (cost=0.00..3.52 rows=1 width=0) (actual time=0.079..0.081 rows=2 loops=10683753)
           ->  Append  (cost=0.00..3.02 rows=200 width=0) (actual time=0.017..0.062 rows=8 loops=10683753)
             ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..1.51 rows=100 width=0) (actual time=0.014..0.025 rows=4 loops=10683753)
                   ->  Result  (cost=0.00..0.51 rows=100 width=0) (actual time=0.012..0.018 rows=4 loops=10683753)

我对postgres比较陌生，所以不知道优化。我还听说postgres中的数组很重要。因此，我搜索了在表格中存储单词包的替代方法，但找不到数组以外的功能。

优化postgres中的查询

0 个答案: