使用许多相邻行上的WHERE来缓慢地进行Postgres查询

时间:2017-11-26 10:18:39

标签: sql postgresql nlp query-performance part-of-speech

我有以下psql表。它总共大约有20亿行。

 id  word      lemma     pos              textid  source     
 1  Stuffing   stuff      vvg             190568  AN         
 2  her        her        appge           190568  AN         
 3  key        key        nn1             190568  AN         
 4  into       into       ii              190568  AN         
 5  the        the        at              190568  AN         
 6  lock       lock       nn1             190568  AN         
 7  she        she        appge           190568  AN         
 8  pushed     push       vvd             190568  AN         
 9  her        her        appge           190568  AN         
10  way        way        nn1             190568  AN         
11  into       into       ii              190568  AN         
12  the        the        appge           190568  AN         
13  house      house      nn1             190568  AN         
14  .                     .               190568  AN         
15  She        she        appge           190568  AN         
16  had        have       vhd             190568  AN         
17  also       also       rr              190568  AN         
18  cajoled    cajole     vvd             190568  AN         
19  her        her        appge           190568  AN         
20  way        way        nn1             190568  AN         
21  into       into       ii              190568  AN         
22  the        the        at              190568  AN         
23  home       home       nn1             190568  AN         
24  .                     .               190568  AN         
..  ...        ...        ..              ...     ..

我想创建下表,其中显示了所有“方式” - 并排的单词和“source”,“lemma”和“pos”列中的一些数据。

source     word   word       word       lemma      pos        word       word     word       word       word       lemma      pos        word       word       
AN         lock   she        pushed     push       vvd        her        way      into       the        house      house      nn1        .          she
AN         had    also       cajoled    cajole     vvd        her        way      into       the        home       home       nn1        .          A          
AN         tried  to         force      force      vvi        her        way      into       the        palace     palace     nn1        ,          officials  

在这里你可以看到我使用的代码:

copy(
SELECT   c1.source, c1.word,  c2.word, c3.word,  c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word

FROM 

orderedflatcorpus AS c1, orderedflatcorpus AS c2, orderedflatcorpus AS c3, orderedflatcorpus AS c4, orderedflatcorpus AS c5, orderedflatcorpus AS c6, orderedflatcorpus AS c7, orderedflatcorpus AS c8, orderedflatcorpus AS c9, orderedflatcorpus AS c10, orderedflatcorpus AS c11

WHERE

c1.word LIKE '%' AND
c2.word LIKE '%' AND
c3.word LIKE '%' AND
c4.pos LIKE 'v%' AND
c5.pos = 'appge' AND
c6.lemma = 'way' AND
c7.pos LIKE 'i%' AND
c8.word = 'the' AND
c9.pos LIKE 'n%' AND
c10.word LIKE '%' AND
c11.word LIKE '%' 

AND 

c1.id + 1 = c2.id AND c1.id + 2 = c3.id AND c1.id + 3 = c4.id AND c1.id + 4 = c5.id AND c1.id + 5 = c6.id AND c1.id + 6 = c7.id AND c1.id + 7 = c8.id AND c1.id + 8 = c9.id AND c1.id + 9 = c10.id AND c1.id + 10 = c11.id

ORDER BY c1.id
)
TO 
'/home/postgres/Results/OUTPUT.csv'
DELIMITER E'\t'
csv header;

对于20亿行执行查询大约需要9个小时(结果大约有19,000行)。

我可以做些什么来提高性能?

单词,pos和lemma列已经有btree索引。

我应该坚持使用我的代码,只需使用功能更强大的服务器,内核更多,CPU速度更快,RAM更多(我的内存只有8 GB,仅有2个内核和2.8 GHz)?或者您会推荐一种不同的,更有效的SQL查询吗?

谢谢!

1 个答案:

答案 0 :(得分:0)

让我们尝试重新格式化您的查询,看看我们能看到什么。首先要做的是将其更改为使用ANSI样式的连接,以便我们可以清楚地看到这些关系是什么:

SELECT c1.source, c1.word,  c2.word, c3.word, c4.word,
       c4.lemma, c4.pos, c5.word, c6.word, c7.word,
       c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
  FROM orderedflatcorpus c1
  INNER JOIN orderedflatcorpus c2
    ON c2.ID = c1.ID + 1 AND
       c2.WORD LIKE '%'
  INNER JOIN orderedflatcorpus c3
    ON c3.ID = c1.ID + 2 AND
       c3.WORD LIKE '%'
  INNER JOIN orderedflatcorpus c4
    ON c4.ID = c1.ID + 3 AND
       c4.pos LIKE 'v%'
  INNER JOIN orderedflatcorpus c5
    ON c5.ID = c1.ID + 4 AND
       c5.pos = 'appge'
  INNER JOIN orderedflatcorpus c6
    ON c6.ID = c1.ID + 5 AND
       c6.lemma = 'way'
  INNER JOIN orderedflatcorpus c7
    ON c7.ID = c1.ID + 6 AND
       c7.pos LIKE 'i%'
  INNER JOIN orderedflatcorpus c8
    ON c8.ID = c1.ID + 7 AND
       c8.word = 'the'
  INNER JOIN orderedflatcorpus c9
    ON c9.ID = c1.ID + 8 AND
       c9.pos LIKE 'n%'
  INNER JOIN orderedflatcorpus c10
    ON c10.ID = c1.ID + 9 AND
       c10.WORD LIKE '%'
  INNER JOIN orderedflatcorpus c11
    ON c11.ID = c1.ID + 10 AND
       c11.WORD LIKE '%'
WHERE c1.WORD LIKE '%'
ORDER BY c1.id

好的,首先关闭 - 所有那些LIKE都在杀死这个查询。让我们尽可能地消除它们。我将在这里假设在ORDEREDFLATCORPUS中WORD不能为NULL,因此可以消除所有IS LIKE '%'条件:

SELECT c1.source, c1.word,  c2.word, c3.word, c4.word,
       c4.lemma, c4.pos, c5.word, c6.word, c7.word,
       c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
  FROM orderedflatcorpus c1
  INNER JOIN orderedflatcorpus c2
    ON c2.ID = c1.ID + 1
  INNER JOIN orderedflatcorpus c3
    ON c3.ID = c1.ID + 2
  INNER JOIN orderedflatcorpus c4
    ON c4.ID = c1.ID + 3 AND
       c4.pos LIKE 'v%'
  INNER JOIN orderedflatcorpus c5
    ON c5.ID = c1.ID + 4 AND
       c5.pos = 'appge'
  INNER JOIN orderedflatcorpus c6
    ON c6.ID = c1.ID + 5 AND
       c6.lemma = 'way'
  INNER JOIN orderedflatcorpus c7
    ON c7.ID = c1.ID + 6 AND
       c7.pos LIKE 'i%'
  INNER JOIN orderedflatcorpus c8
    ON c8.ID = c1.ID + 7 AND
       c8.word = 'the'
  INNER JOIN orderedflatcorpus c9
    ON c9.ID = c1.ID + 8 AND
       c9.pos LIKE 'n%'
  INNER JOIN orderedflatcorpus c10
    ON c10.ID = c1.ID + 9
  INNER JOIN orderedflatcorpus c11
    ON c11.ID = c1.ID + 10
ORDER BY c1.id

如果WORD可以为NULL,那么您可能需要使用:

SELECT c1.source, c1.word,  c2.word, c3.word, c4.word,
       c4.lemma, c4.pos, c5.word, c6.word, c7.word,
       c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
  FROM orderedflatcorpus c1
  INNER JOIN orderedflatcorpus c2
    ON c2.ID = c1.ID + 1 AND
       c2.WORD IS NOT NULL
  INNER JOIN orderedflatcorpus c3
    ON c3.ID = c1.ID + 2 AND
       c3.WORD IS NOT NULL
  INNER JOIN orderedflatcorpus c4
    ON c4.ID = c1.ID + 3 AND
       c4.pos LIKE 'v%'
  INNER JOIN orderedflatcorpus c5
    ON c5.ID = c1.ID + 4 AND
       c5.pos = 'appge'
  INNER JOIN orderedflatcorpus c6
    ON c6.ID = c1.ID + 5 AND
       c6.lemma = 'way'
  INNER JOIN orderedflatcorpus c7
    ON c7.ID = c1.ID + 6 AND
       c7.pos LIKE 'i%'
  INNER JOIN orderedflatcorpus c8
    ON c8.ID = c1.ID + 7 AND
       c8.word = 'the'
  INNER JOIN orderedflatcorpus c9
    ON c9.ID = c1.ID + 8 AND
       c9.pos LIKE 'n%'
  INNER JOIN orderedflatcorpus c10
    ON c10.ID = c1.ID + 9 AND
       c10.WORD IS NOT NULL
  INNER JOIN orderedflatcorpus c11
    ON c11.ID = c1.ID + 10 AND
       c11.WORD IS NOT NULL
WHERE c1.WORD IS NOT NULL
ORDER BY c1.id

接下来 - 此查询需要尽可能早地进行尽可能多的过滤。数据库查询优化器可以能够解决这个问题,但是让我们通过将equijoins放在连接列表中的第一个来给它一些帮助,然后调整ID计算以反映我们首先获得的信息:

SELECT c1.source, c1.word,  c2.word, c3.word, c4.word,
       c4.lemma, c4.pos, c5.word, c6.word, c7.word,
       c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
  FROM DUAL
  INNER JOIN orderedflatcorpus c5
    ON c5.pos = 'appge'
  INNER JOIN orderedflatcorpus c6
    ON c6.ID = c5.ID + 1 AND
       c6.lemma = 'way'
  INNER JOIN orderedflatcorpus c8
    ON c8.ID = c5.ID + 3 AND
       c8.word = 'the'
  INNER JOIN orderedflatcorpus c1
    ON c1.ID = c5.ID - 4 AND
  INNER JOIN orderedflatcorpus c2
    ON c2.ID = c5.ID - 3
  INNER JOIN orderedflatcorpus c3
    ON c3.ID = c5.ID - 2
  INNER JOIN orderedflatcorpus c4
    ON c4.ID = c5.ID - 1 AND
       c4.pos LIKE 'v%'
  INNER JOIN orderedflatcorpus c7
    ON c7.ID = c5.ID + 2 AND
       c7.pos LIKE 'i%'
  INNER JOIN orderedflatcorpus c9
    ON c9.ID = c5.ID + 4 AND
       c9.pos LIKE 'n%'
  INNER JOIN orderedflatcorpus c10
    ON c10.ID = c5.ID + 5
  INNER JOIN orderedflatcorpus c11
    ON c11.ID = c5.ID + 6
ORDER BY c1.id

接下来,我们需要考虑哪些索引最有用。我认为以下索引值得拥有:

(ID)
(ID, WORD)
(ID, LEMMA)
(ID, POS)

打开这些索引,运行此查询,看看是否有帮助。另外,检查ID计算 - 我认为我把它们弄好但是......: - )

祝你好运。