如何在进行合并时避免hsqldb中的OOM?

时间:2013-03-10 20:17:46

标签: java sql merge hsqldb

我有两个表,其中第一个非常大(> 50M行):

CREATE CACHED TABLE Alldistances (
    word1 VARCHAR(70), 
    word2 VARCHAR(70), 
    distance INTEGER, 
    distcount INTEGER
);

和第二个也可能非常大(> 5M行):

CREATE CACHED TABLE tempcach (
    word1 VARCHAR(70), 
    word2 VARCHAR(70), 
    distance INTEGER, 
    distcount INTEGER
);

两个表都有索引:

CREATE INDEX mulalldis ON Alldistances (word1, word2, distance);
CREATE INDEX multem ON tempcach (word1, word2, distance);

在我的java程序中,我使用预准备语句填充/预先组织tempcach表中的数据,然后将表合并到alldistances中:

MERGE INTO Alldistances alld USING ( 

    SELECT word1, 
           word2, 
           distance, 
           distcount FROM tempcach 

    ) AS src (

        newword1, 
        newword2, 
        newdistance, 
        newcount

    ) ON (

            alld.word1 = src.newword1 
        AND alld.word2 = src.newword2 
        AND alld.distance = src.newdistance 

    ) WHEN MATCHED THEN 

        UPDATE SET alld.distcount = alld.distcount+src.newcount 

    WHEN NOT MATCHED THEN 

        INSERT (

            word1, 
            word2, 
            distance, 
            distcount

        ) VALUES (

            newword1, 
            newword2, 
            newdistance, 
            newcount
        );

然后删除或截断tempchach表并填充新数据。 在合并期间,我得到了OOM,我猜是因为整个表在合并期间被加载到内存中。所以我将不得不批量合并,但我可以在SQL中执行此操作,还是在我的java程序中执行此操作。或者在合并时是否有一种避免OOM的聪明方法?

2 个答案:

答案 0 :(得分:0)

可以在SQL中以块(批处理)合并。你需要

  • 限制每个块中临时表的行数
  • 删除相同的行
  • 重复

SELECT语句应使用ORDER BY和LIMIT

SELECT word1, 
       word2, 
       distance, 
       distcount FROM tempcach
       ORDER BY primary key or unique columns 
       LIMIT 1000

) AS src (

合并后,delete语句将选择要删除的相同行

DELETE FROM tempcach WHERE primary key or unique columns IN
      (SELECT primary key or unique columns FROM tempcach 
       ORDER BY primary key or unique columns LIMIT 1000)

答案 1 :(得分:0)

首先,只是因为这种事情让我烦恼,为什么要在子选择中选择临时表的所有字段?为什么不是更简单的SQL:

MERGE INTO Alldistances alld USING tempcach AS src (
    newword1, 
    newword2, 
    newdistance, 
    newcount
) ON (
        alld.word1 = src.newword1 
    AND alld.word2 = src.newword2 
    AND alld.distance = src.newdistance 
) WHEN MATCHED THEN 
    UPDATE SET alld.distcount = alld.distcount+src.newcount 
WHEN NOT MATCHED THEN 
    INSERT (
        word1, 
        word2, 
        distance, 
        distcount
    ) VALUES (
        newword1, 
        newword2, 
        newdistance, 
        newcount
    );

让数据库避免将整个表加载到内存中所需要的是对两个表进行索引。

CREATE INDEX all_data ON Alldistances (word1, word2, distance);
CREATE INDEX tempcach_data ON tempcach (word1, word2, distance);