我有两个表具有相同的值,但bot来自不同的来源。
Table 1
------------
ID Title
1 Introduction to Science
2 Introduction to C
3 Let is C
4 C
5 Java
Table 2
------------------------
ID Title
a Intro to Science
b Intro to C
c Let is C
d C
e Java
我想将表1中的所有标题与表2中的标题进行比较,并找到相似性匹配。
我在orcale中使用了内置函数“ UTL_MATCH.edit_distance_similarity(LS_Title,LSO_Title);”
脚本:
DECLARE
LS_count NUMBER;
LSO_count NUMBER;
percentage NUMBER;
LS_Title VARCHAR2 (4000);
LSO_Title VARCHAR2 (4000);
LS_CPNT_ID VARCHAR2 (64);
LSO_CPNT_ID VARCHAR2 (64);
BEGIN
SELECT COUNT (*) INTO LS_count FROM tbl_zim_item;
SELECT COUNT (*) INTO LSO_count FROM tbl_zim_lso_item;
DBMS_OUTPUT.put_line ('value of a: ' || LS_count);
DBMS_OUTPUT.put_line ('value of a: ' || LSO_count);
FOR i IN 1 .. LS_count
LOOP
SELECT cpnt_title
INTO LS_Title
FROM tbl_zim_item
WHERE iden = i;
SELECT cpnt_id
INTO LS_CPNT_ID
FROM tbl_zim_item
WHERE iden = i;
FOR j IN 1 .. lso_count
LOOP
SELECT cpnt_title
INTO LSO_Title
FROM tbl_zim_lso_item
WHERE iden = j;
SELECT cpnt_id
INTO LSO_CPNT_ID
FROM tbl_zim_lso_item
WHERE iden = j;
percentage :=
UTL_MATCH.edit_distance_similarity (LS_Title, LSO_Title);
IF percentage > 50
THEN
INSERT INTO title_sim
VALUES (ls_cpnt_id,
ls_title,
lso_cpnt_id,
lso_title,
percentage);
END IF;
END LOOP;
END LOOP;
END;
运行超过15个小时。请提供更好的解决方案。 注意:我的表1有20000条记录,表2有10000条记录。
答案 0 :(得分:2)
除非我遗漏了某些东西,否则你不需要所有的循环和逐行查找,因为SQL可以进行交叉连接。因此,我的第一次尝试就是:
insert into title_sim
( columns... )
select ls_cpnt_id
, ls_title
, lso_cpnt_id
, lso_title
, percentage
from ( select i.cpnt_id as ls_cpnt_id
, i.cpnt_title as ls_title
, li.cpnt_id as lso_cpnt_id
, li.cpnt_title as lso_title
, case -- Using Boneist's suggestion:
when i.cpnt_title = li.cpnt_title then 100
else utl_match.edit_distance_similarity(i.cpnt_title, li.cpnt_title)
end as percentage
from tbl_zim_item i
cross join tbl_zim_lso_item li )
where percentage > 50;
如果标题中有多次重复,您可以通过将utl_match.edit_distance_similarity
函数包装在( select ... from dual )
中来从一些标量子查询缓存中受益。
如果标题通常完全相同,并假设在这些情况下百分比应该是100%,那么当标题完全匹配时,您可能会避免调用该函数:
begin
select count(*) into ls_count from tbl_zim_item;
select count(*) into lso_count from tbl_zim_lso_item;
dbms_output.put_line('tbl_zim_item contains ' || ls_count || ' rows.');
dbms_output.put_line('tbl_zim_lso_item contains ' || lso_count || ' rows.');
for r in (
select i.cpnt_id as ls_cpnt_id
, i.cpnt_title as ls_title
, li.cpnt_id as lso_cpnt_id
, li.cpnt_title as lso_title
, case
when i.cpnt_title = li.cpnt_title then 100 else 0
end as percentage
from tbl_zim_item i
cross join tbl_zim_lso_item li
)
loop
if r.percentage < 100 then
r.percentage := utl_match.edit_distance_similarity(r.ls_title, r.lso_title);
end if;
if r.percentage > 50 then
insert into title_sim (columns...)
values
( ls_cpnt_id
, ls_title
, lso_cpnt_id
, lso_title
, percentage );
end if;
end loop;
end;
答案 1 :(得分:1)
我只是将两个表连接在一起,而不是遍历所有数据,例如:
WITH t1 AS (SELECT 1 ID, 'Introduction to Science' title FROM dual UNION ALL
SELECT 2 ID, 'Introduction to C' title FROM dual UNION ALL
SELECT 3 ID, 'Let is C' title FROM dual UNION ALL
SELECT 4 ID, 'C' title FROM dual UNION ALL
SELECT 5 ID, 'Java' title FROM dual UNION ALL
SELECT 6 ID, 'Oracle for Newbies' title FROM dual),
t2 AS (SELECT 'a' ID, 'Intro to Science' title FROM dual UNION ALL
SELECT 'b' ID, 'Intro to C' title FROM dual UNION ALL
SELECT 'c' ID, 'Let is C' title FROM dual UNION ALL
SELECT 'd' ID, 'C' title FROM dual UNION ALL
SELECT 'e' ID, 'Java' title FROM dual UNION ALL
SELECT 'f' ID, 'PL/SQL rocks!' title FROM dual)
SELECT t1.title t1_title,
t2.title t2_title,
UTL_MATCH.edit_distance_similarity(t1.title, t2.title)
FROM t1
INNER JOIN t2 ON UTL_MATCH.edit_distance_similarity(t1.title, t2.title) > 50;
T1_TITLE T2_TITLE UTL_MATCH.EDIT_DISTANCE_SIMILA
----------------------- ---------------- ------------------------------
Introduction to Science Intro to Science 70
Introduction to C Intro to C 59
Let is C Let is C 100
C C 100
Java Java 100
通过这样做,您可以将整个事物简化为单个DML语句,例如:
INSERT INTO title_sim (t1_id,
t1_title,
t2_id,
t2_title,
percentage)
SELECT t1.id t1_id,
t1.title t1_title,
t2.id t2_id,
t2.title t2_title,
UTL_MATCH.edit_distance_similarity(t1.title, t2.title) percentage
FROM t1
INNER JOIN t2 ON UTL_MATCH.edit_distance_similarity(t1.title, t2.title) > 50;
这应该比你的逐行尝试更快,特别是因为你不必要地从每个表中选择两次。
顺便说一句,您知道可以在同一个查询中为多个变量选择多个列,对吧?
所以不要:
SELECT cpnt_title
INTO LS_Title
FROM tbl_zim_item
WHERE iden = i;
SELECT cpnt_id
INTO LS_CPNT_ID
FROM tbl_zim_item
WHERE iden = i;
你可以改为:
SELECT cpnt_title, cpnt_id
INTO LS_Title, LS_CPNT_ID
FROM tbl_zim_item
WHERE iden = i;
答案 2 :(得分:-1)
https://www.techonthenet.com/oracle/intersect.php
这将为您提供两个查询中相似的数据
select title from table_1
intersect
select title from table_2