我有两个表,其中第一个非常大(> 50M行):
CREATE CACHED TABLE Alldistances (
word1 VARCHAR(70),
word2 VARCHAR(70),
distance INTEGER,
distcount INTEGER
);
和第二个也可能非常大(> 5M行):
CREATE CACHED TABLE tempcach (
word1 VARCHAR(70),
word2 VARCHAR(70),
distance INTEGER,
distcount INTEGER
);
两个表都有索引:
CREATE INDEX mulalldis ON Alldistances (word1, word2, distance);
CREATE INDEX multem ON tempcach (word1, word2, distance);
在我的java程序中,我使用预准备语句填充/预先组织tempcach表中的数据,然后将表合并到alldistances中:
MERGE INTO Alldistances alld USING (
SELECT word1,
word2,
distance,
distcount FROM tempcach
) AS src (
newword1,
newword2,
newdistance,
newcount
) ON (
alld.word1 = src.newword1
AND alld.word2 = src.newword2
AND alld.distance = src.newdistance
) WHEN MATCHED THEN
UPDATE SET alld.distcount = alld.distcount+src.newcount
WHEN NOT MATCHED THEN
INSERT (
word1,
word2,
distance,
distcount
) VALUES (
newword1,
newword2,
newdistance,
newcount
);
然后删除或截断tempchach表并填充新数据。 在合并期间,我得到了OOM,我猜是因为整个表在合并期间被加载到内存中。所以我将不得不批量合并,但我可以在SQL中执行此操作,还是在我的java程序中执行此操作。或者在合并时是否有一种避免OOM的聪明方法?
答案 0 :(得分:0)
可以在SQL中以块(批处理)合并。你需要
SELECT语句应使用ORDER BY和LIMIT
SELECT word1,
word2,
distance,
distcount FROM tempcach
ORDER BY primary key or unique columns
LIMIT 1000
) AS src (
合并后,delete语句将选择要删除的相同行
DELETE FROM tempcach WHERE primary key or unique columns IN
(SELECT primary key or unique columns FROM tempcach
ORDER BY primary key or unique columns LIMIT 1000)
答案 1 :(得分:0)
首先,只是因为这种事情让我烦恼,为什么要在子选择中选择临时表的所有字段?为什么不是更简单的SQL:
MERGE INTO Alldistances alld USING tempcach AS src (
newword1,
newword2,
newdistance,
newcount
) ON (
alld.word1 = src.newword1
AND alld.word2 = src.newword2
AND alld.distance = src.newdistance
) WHEN MATCHED THEN
UPDATE SET alld.distcount = alld.distcount+src.newcount
WHEN NOT MATCHED THEN
INSERT (
word1,
word2,
distance,
distcount
) VALUES (
newword1,
newword2,
newdistance,
newcount
);
让数据库避免将整个表加载到内存中所需要的是对两个表进行索引。
CREATE INDEX all_data ON Alldistances (word1, word2, distance);
CREATE INDEX tempcach_data ON tempcach (word1, word2, distance);