我有一个表(doc(varchar),sentenceId(int),bow(varchar []))。在弓柱我导入一袋字。总共有4623行。表(nlptable)看起来像(对于10个向量):
doc | sentenceId | bow
-------------------------------------------
corpus | 1 | {1,0,0,1,0,0,0,1,0,1}
corpus | 2 | {0,1,1,1,0,1,0,0,0,0}
我运行的查询(比较两个句子的单词表示):
select a.doc, a.sentenceid, b.doc, b.sentenceid,
cardinality(array(select unnest(array_positions(a.bow, '1')) intersect select unnest(array_positions(b.bow, '1'))))::float / cardinality(a.bow)::float
from
nlptable a, nlptable b
where
a.sentenceid < b.sentenceid;
问题在于,对于10个向量(10个最常见的单词),执行时间约为3分钟,100个向量约为25分钟,500个向量为80分钟。我必须计算10,000个最常见的单词,这可能需要1天。查询太慢了,我想优化它,但现在想法如何。是否有选项可以使用某种布尔类型而不是varchar []来减少数据大小?我也做了解释分析:
Nested Loop (cost=0.00..25486703.85 rows=7124043 width=232) (actual time=2.028..1126759.206 rows=10683753 loops=1)
Join Filter: (a.sentenceid < b.sentenceid)
Rows Removed by Join Filter: 10688376
-> Seq Scan on nlpdata a (cost=0.00..214.23 rows=4623 width=116) (actual time=0.261..9.486 rows=4623 loops=1)
-> Materialize (cost=0.00..237.35 rows=4623 width=116) (actual time=0.001..3.509 rows=4623 loops=4623)
-> Seq Scan on nlpdata b (cost=0.00..214.23 rows=4623 width=116) (actual time=0.207..6.968 rows=4623 loops=1)
SubPlan 1
-> HashSetOp Intersect (cost=0.00..3.52 rows=1 width=0) (actual time=0.079..0.081 rows=2 loops=10683753)
-> Append (cost=0.00..3.02 rows=200 width=0) (actual time=0.017..0.062 rows=8 loops=10683753)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..1.51 rows=100 width=0) (actual time=0.014..0.025 rows=4 loops=10683753)
-> Result (cost=0.00..0.51 rows=100 width=0) (actual time=0.012..0.018 rows=4 loops=10683753)
我对postgres比较陌生,所以不知道优化。我还听说postgres中的数组很重要。因此,我搜索了在表格中存储单词包的替代方法,但找不到数组以外的功能。