我有以下psql表。它总共大约有20亿行。
id word lemma pos textid source
1 Stuffing stuff vvg 190568 AN
2 her her appge 190568 AN
3 key key nn1 190568 AN
4 into into ii 190568 AN
5 the the at 190568 AN
6 lock lock nn1 190568 AN
7 she she appge 190568 AN
8 pushed push vvd 190568 AN
9 her her appge 190568 AN
10 way way nn1 190568 AN
11 into into ii 190568 AN
12 the the appge 190568 AN
13 house house nn1 190568 AN
14 . . 190568 AN
15 She she appge 190568 AN
16 had have vhd 190568 AN
17 also also rr 190568 AN
18 cajoled cajole vvd 190568 AN
19 her her appge 190568 AN
20 way way nn1 190568 AN
21 into into ii 190568 AN
22 the the at 190568 AN
23 home home nn1 190568 AN
24 . . 190568 AN
.. ... ... .. ... ..
我想创建下表,其中显示了所有“方式” - 并排的单词和“source”,“lemma”和“pos”列中的一些数据。
source word word word lemma pos word word word word word lemma pos word word
AN lock she pushed push vvd her way into the house house nn1 . she
AN had also cajoled cajole vvd her way into the home home nn1 . A
AN tried to force force vvi her way into the palace palace nn1 , officials
在这里你可以看到我使用的代码:
copy(
SELECT c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM
orderedflatcorpus AS c1, orderedflatcorpus AS c2, orderedflatcorpus AS c3, orderedflatcorpus AS c4, orderedflatcorpus AS c5, orderedflatcorpus AS c6, orderedflatcorpus AS c7, orderedflatcorpus AS c8, orderedflatcorpus AS c9, orderedflatcorpus AS c10, orderedflatcorpus AS c11
WHERE
c1.word LIKE '%' AND
c2.word LIKE '%' AND
c3.word LIKE '%' AND
c4.pos LIKE 'v%' AND
c5.pos = 'appge' AND
c6.lemma = 'way' AND
c7.pos LIKE 'i%' AND
c8.word = 'the' AND
c9.pos LIKE 'n%' AND
c10.word LIKE '%' AND
c11.word LIKE '%'
AND
c1.id + 1 = c2.id AND c1.id + 2 = c3.id AND c1.id + 3 = c4.id AND c1.id + 4 = c5.id AND c1.id + 5 = c6.id AND c1.id + 6 = c7.id AND c1.id + 7 = c8.id AND c1.id + 8 = c9.id AND c1.id + 9 = c10.id AND c1.id + 10 = c11.id
ORDER BY c1.id
)
TO
'/home/postgres/Results/OUTPUT.csv'
DELIMITER E'\t'
csv header;
对于20亿行执行查询大约需要9个小时(结果大约有19,000行)。
我可以做些什么来提高性能?
单词,pos和lemma列已经有btree索引。
我应该坚持使用我的代码,只需使用功能更强大的服务器,内核更多,CPU速度更快,RAM更多(我的内存只有8 GB,仅有2个内核和2.8 GHz)?或者您会推荐一种不同的,更有效的SQL查询吗?
谢谢!
答案 0 :(得分:0)
让我们尝试重新格式化您的查询,看看我们能看到什么。首先要做的是将其更改为使用ANSI样式的连接,以便我们可以清楚地看到这些关系是什么:
SELECT c1.source, c1.word, c2.word, c3.word, c4.word,
c4.lemma, c4.pos, c5.word, c6.word, c7.word,
c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM orderedflatcorpus c1
INNER JOIN orderedflatcorpus c2
ON c2.ID = c1.ID + 1 AND
c2.WORD LIKE '%'
INNER JOIN orderedflatcorpus c3
ON c3.ID = c1.ID + 2 AND
c3.WORD LIKE '%'
INNER JOIN orderedflatcorpus c4
ON c4.ID = c1.ID + 3 AND
c4.pos LIKE 'v%'
INNER JOIN orderedflatcorpus c5
ON c5.ID = c1.ID + 4 AND
c5.pos = 'appge'
INNER JOIN orderedflatcorpus c6
ON c6.ID = c1.ID + 5 AND
c6.lemma = 'way'
INNER JOIN orderedflatcorpus c7
ON c7.ID = c1.ID + 6 AND
c7.pos LIKE 'i%'
INNER JOIN orderedflatcorpus c8
ON c8.ID = c1.ID + 7 AND
c8.word = 'the'
INNER JOIN orderedflatcorpus c9
ON c9.ID = c1.ID + 8 AND
c9.pos LIKE 'n%'
INNER JOIN orderedflatcorpus c10
ON c10.ID = c1.ID + 9 AND
c10.WORD LIKE '%'
INNER JOIN orderedflatcorpus c11
ON c11.ID = c1.ID + 10 AND
c11.WORD LIKE '%'
WHERE c1.WORD LIKE '%'
ORDER BY c1.id
好的,首先关闭 - 所有那些LIKE都在杀死这个查询。让我们尽可能地消除它们。我将在这里假设在ORDEREDFLATCORPUS中WORD不能为NULL,因此可以消除所有IS LIKE '%'
条件:
SELECT c1.source, c1.word, c2.word, c3.word, c4.word,
c4.lemma, c4.pos, c5.word, c6.word, c7.word,
c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM orderedflatcorpus c1
INNER JOIN orderedflatcorpus c2
ON c2.ID = c1.ID + 1
INNER JOIN orderedflatcorpus c3
ON c3.ID = c1.ID + 2
INNER JOIN orderedflatcorpus c4
ON c4.ID = c1.ID + 3 AND
c4.pos LIKE 'v%'
INNER JOIN orderedflatcorpus c5
ON c5.ID = c1.ID + 4 AND
c5.pos = 'appge'
INNER JOIN orderedflatcorpus c6
ON c6.ID = c1.ID + 5 AND
c6.lemma = 'way'
INNER JOIN orderedflatcorpus c7
ON c7.ID = c1.ID + 6 AND
c7.pos LIKE 'i%'
INNER JOIN orderedflatcorpus c8
ON c8.ID = c1.ID + 7 AND
c8.word = 'the'
INNER JOIN orderedflatcorpus c9
ON c9.ID = c1.ID + 8 AND
c9.pos LIKE 'n%'
INNER JOIN orderedflatcorpus c10
ON c10.ID = c1.ID + 9
INNER JOIN orderedflatcorpus c11
ON c11.ID = c1.ID + 10
ORDER BY c1.id
如果WORD可以为NULL,那么您可能需要使用:
SELECT c1.source, c1.word, c2.word, c3.word, c4.word,
c4.lemma, c4.pos, c5.word, c6.word, c7.word,
c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM orderedflatcorpus c1
INNER JOIN orderedflatcorpus c2
ON c2.ID = c1.ID + 1 AND
c2.WORD IS NOT NULL
INNER JOIN orderedflatcorpus c3
ON c3.ID = c1.ID + 2 AND
c3.WORD IS NOT NULL
INNER JOIN orderedflatcorpus c4
ON c4.ID = c1.ID + 3 AND
c4.pos LIKE 'v%'
INNER JOIN orderedflatcorpus c5
ON c5.ID = c1.ID + 4 AND
c5.pos = 'appge'
INNER JOIN orderedflatcorpus c6
ON c6.ID = c1.ID + 5 AND
c6.lemma = 'way'
INNER JOIN orderedflatcorpus c7
ON c7.ID = c1.ID + 6 AND
c7.pos LIKE 'i%'
INNER JOIN orderedflatcorpus c8
ON c8.ID = c1.ID + 7 AND
c8.word = 'the'
INNER JOIN orderedflatcorpus c9
ON c9.ID = c1.ID + 8 AND
c9.pos LIKE 'n%'
INNER JOIN orderedflatcorpus c10
ON c10.ID = c1.ID + 9 AND
c10.WORD IS NOT NULL
INNER JOIN orderedflatcorpus c11
ON c11.ID = c1.ID + 10 AND
c11.WORD IS NOT NULL
WHERE c1.WORD IS NOT NULL
ORDER BY c1.id
接下来 - 此查询需要尽可能早地进行尽可能多的过滤。数据库查询优化器可以能够解决这个问题,但是让我们通过将equijoins放在连接列表中的第一个来给它一些帮助,然后调整ID计算以反映我们首先获得的信息:
SELECT c1.source, c1.word, c2.word, c3.word, c4.word,
c4.lemma, c4.pos, c5.word, c6.word, c7.word,
c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM DUAL
INNER JOIN orderedflatcorpus c5
ON c5.pos = 'appge'
INNER JOIN orderedflatcorpus c6
ON c6.ID = c5.ID + 1 AND
c6.lemma = 'way'
INNER JOIN orderedflatcorpus c8
ON c8.ID = c5.ID + 3 AND
c8.word = 'the'
INNER JOIN orderedflatcorpus c1
ON c1.ID = c5.ID - 4 AND
INNER JOIN orderedflatcorpus c2
ON c2.ID = c5.ID - 3
INNER JOIN orderedflatcorpus c3
ON c3.ID = c5.ID - 2
INNER JOIN orderedflatcorpus c4
ON c4.ID = c5.ID - 1 AND
c4.pos LIKE 'v%'
INNER JOIN orderedflatcorpus c7
ON c7.ID = c5.ID + 2 AND
c7.pos LIKE 'i%'
INNER JOIN orderedflatcorpus c9
ON c9.ID = c5.ID + 4 AND
c9.pos LIKE 'n%'
INNER JOIN orderedflatcorpus c10
ON c10.ID = c5.ID + 5
INNER JOIN orderedflatcorpus c11
ON c11.ID = c5.ID + 6
ORDER BY c1.id
接下来,我们需要考虑哪些索引最有用。我认为以下索引值得拥有:
(ID)
(ID, WORD)
(ID, LEMMA)
(ID, POS)
打开这些索引,运行此查询,看看是否有帮助。另外,检查ID计算 - 我认为我把它们弄好但是......: - )
祝你好运。