在oracle中嵌套for循环以找到相似性优化

时间:2017-01-10 09:14:03

标签: sql oracle plsql oracle11g

我有两个表具有相同的值,但bot来自不同的来源。

Table 1
------------
ID  Title
1   Introduction to Science
2   Introduction to C
3   Let is C
4   C
5   Java

Table 2
------------------------
ID  Title
a   Intro to Science
b   Intro to C
c   Let is C
d   C
e   Java

我想将表1中的所有标题与表2中的标题进行比较,并找到相似性匹配。

我在orcale中使用了内置函数“ UTL_MATCH.edit_distance_similarity(LS_Title,LSO_Title);”

脚本:

DECLARE
LS_count      NUMBER;
LSO_count     NUMBER;
percentage    NUMBER;
LS_Title      VARCHAR2 (4000);
LSO_Title     VARCHAR2 (4000);
LS_CPNT_ID    VARCHAR2 (64);
LSO_CPNT_ID   VARCHAR2 (64);
BEGIN
SELECT COUNT (*) INTO LS_count FROM tbl_zim_item;
SELECT COUNT (*) INTO LSO_count FROM tbl_zim_lso_item;
DBMS_OUTPUT.put_line ('value of a: ' || LS_count);
DBMS_OUTPUT.put_line ('value of a: ' || LSO_count);
FOR i IN 1 .. LS_count
LOOP
  SELECT cpnt_title
    INTO LS_Title
    FROM tbl_zim_item
   WHERE iden = i;

  SELECT cpnt_id
    INTO LS_CPNT_ID
    FROM tbl_zim_item
   WHERE iden = i;

  FOR j IN 1 .. lso_count
  LOOP
     SELECT cpnt_title
       INTO LSO_Title
       FROM tbl_zim_lso_item
      WHERE iden = j;

     SELECT cpnt_id
       INTO LSO_CPNT_ID
       FROM tbl_zim_lso_item
      WHERE iden = j;

     percentage :=
        UTL_MATCH.edit_distance_similarity (LS_Title, LSO_Title);

     IF percentage > 50
     THEN
        INSERT INTO title_sim
             VALUES (ls_cpnt_id,
                     ls_title,
                     lso_cpnt_id,
                     lso_title,
                     percentage);
     END IF;
  END LOOP;
END LOOP;
END;

运行超过15个小时。请提供更好的解决方案。 注意:我的表1有20000条记录,表2有10000条记录。

3 个答案:

答案 0 :(得分:2)

除非我遗漏了某些东西,否则你不需要所有的循环和逐行查找,因为SQL可以进行交叉连接。因此,我的第一次尝试就是:

insert into title_sim
     ( columns... )
select ls_cpnt_id
     , ls_title
     , lso_cpnt_id
     , lso_title
     , percentage
from   ( select i.cpnt_id     as ls_cpnt_id
              , i.cpnt_title  as ls_title
              , li.cpnt_id    as lso_cpnt_id
              , li.cpnt_title as lso_title
              , case  -- Using Boneist's suggestion:
                    when i.cpnt_title = li.cpnt_title then 100
                    else utl_match.edit_distance_similarity(i.cpnt_title, li.cpnt_title)
                end as percentage
         from   tbl_zim_item i
                cross join tbl_zim_lso_item li )
where  percentage > 50;

如果标题中有多次重复,您可以通过将utl_match.edit_distance_similarity函数包装在( select ... from dual )中来从一些标量子查询缓存中受益。

如果标题通常完全相同,并假设在这些情况下百分比应该是100%,那么当标题完全匹配时,您可能会避免调用该函数:

begin
    select count(*) into ls_count from tbl_zim_item;
    select count(*) into lso_count from tbl_zim_lso_item;

    dbms_output.put_line('tbl_zim_item contains ' || ls_count || ' rows.');
    dbms_output.put_line('tbl_zim_lso_item contains ' || lso_count || ' rows.');

    for r in (
        select i.cpnt_id     as ls_cpnt_id
             , i.cpnt_title  as ls_title
             , li.cpnt_id    as lso_cpnt_id
             , li.cpnt_title as lso_title
             , case
                   when i.cpnt_title = li.cpnt_title then 100 else 0
               end as percentage
        from   tbl_zim_item i
               cross join tbl_zim_lso_item li
    )
    loop
        if r.percentage < 100 then
            r.percentage := utl_match.edit_distance_similarity(r.ls_title, r.lso_title);
        end if;

        if r.percentage > 50 then
            insert into title_sim (columns...)
            values
            ( ls_cpnt_id
            , ls_title
            , lso_cpnt_id
            , lso_title
            , percentage );
        end if;
    end loop;
end;

答案 1 :(得分:1)

我只是将两个表连接在一起,而不是遍历所有数据,例如:

WITH t1 AS (SELECT 1 ID, 'Introduction to Science' title FROM dual UNION ALL
            SELECT 2 ID, 'Introduction to C' title FROM dual UNION ALL
            SELECT 3 ID, 'Let is C' title FROM dual UNION ALL
            SELECT 4 ID, 'C' title FROM dual UNION ALL
            SELECT 5 ID, 'Java' title FROM dual UNION ALL
            SELECT 6 ID, 'Oracle for Newbies' title FROM dual),
     t2 AS (SELECT 'a' ID, 'Intro to Science' title FROM dual UNION ALL
            SELECT 'b' ID, 'Intro to C' title FROM dual UNION ALL
            SELECT 'c' ID, 'Let is C' title FROM dual UNION ALL
            SELECT 'd' ID, 'C' title FROM dual UNION ALL
            SELECT 'e' ID, 'Java' title FROM dual UNION ALL
            SELECT 'f' ID, 'PL/SQL rocks!' title FROM dual)
SELECT t1.title t1_title,
       t2.title t2_title,
       UTL_MATCH.edit_distance_similarity(t1.title, t2.title)
FROM   t1
       INNER JOIN t2 ON UTL_MATCH.edit_distance_similarity(t1.title, t2.title) > 50;

T1_TITLE                T2_TITLE         UTL_MATCH.EDIT_DISTANCE_SIMILA
----------------------- ---------------- ------------------------------
Introduction to Science Intro to Science                             70
Introduction to C       Intro to C                                   59
Let is C                Let is C                                    100
C                       C                                           100
Java                    Java                                        100

通过这样做,您可以将整个事物简化为单个DML语句,例如:

INSERT INTO title_sim (t1_id,
                       t1_title,
                       t2_id,
                       t2_title,
                       percentage)
SELECT t1.id t1_id,
       t1.title t1_title,
       t2.id t2_id,
       t2.title t2_title,
       UTL_MATCH.edit_distance_similarity(t1.title, t2.title) percentage
FROM   t1
       INNER JOIN t2 ON UTL_MATCH.edit_distance_similarity(t1.title, t2.title) > 50;

这应该比你的逐行尝试更快,特别是因为你不必要地从每个表中选择两次。

顺便说一句,您知道可以在同一个查询中为多个变量选择多个列,对吧?

所以不要:

SELECT cpnt_title
  INTO LS_Title
  FROM tbl_zim_item
 WHERE iden = i;

SELECT cpnt_id
  INTO LS_CPNT_ID
  FROM tbl_zim_item
 WHERE iden = i;
你可以改为:

SELECT cpnt_title, cpnt_id
  INTO LS_Title, LS_CPNT_ID
  FROM tbl_zim_item
 WHERE iden = i;

答案 2 :(得分:-1)

https://www.techonthenet.com/oracle/intersect.php

这将为您提供两个查询中相似的数据

 select title from table_1 
 intersect 
 select title from table_2