Oracle合并为文本相似

时间:2018-01-09 23:53:39

标签: oracle oracle11g

我有2个表:from_country和to_country。我想将新记录和更新记录带到to_country

定义和数据

--
CREATE TABLE from_country
(
  country_code varchar2(255) not null
);

--
CREATE TABLE to_country
(
  country_code varchar2(255) not null
);

-- Meaning match
INSERT INTO from_country
(country_code)
VALUES
('United States of America');

-- Match 100%
INSERT INTO from_country
(country_code)
VALUES
('UGANDA');

-- Meaning match, but with domain knowledge
INSERT INTO from_country
(country_code)
VALUES
('CON CORRECT');

-- Brand new country
INSERT INTO from_country
(country_code)
VALUES
('NEW');


-- 
INSERT INTO to_country
(country_code)
VALUES
('USA');

-- Match 100%
INSERT INTO to_country
(country_code)
VALUES
('UGANDA');

-- Meaning match, but with domain knowledge
INSERT INTO to_country
(country_code)
VALUES
('CON');

我需要运行merge,因此我将数据从from_county引入to_country

这是我的第一次尝试,但它只是平等,这还不够好。我需要一些聪明才能使它能够进行意义匹配。 如果有人知道如何操作,请提供您的解决方案。

merge into 
  to_country to_t
using
  from_country from_t
on
  (to_t.country_code = from_t.country_code)
when not matched then insert (
  country_code
)
values (
  from_t.country_code
);

简而言之,这就是我想要的

from_table:
United States of America
UGANDA
CON CORRECT
NEW


to_table:
USA
UGANDA
CON

oracle合并到

之后
the new to_country table:
United States of America
UGANDA
CON CORRECT
NEW

sql fiddle:http://sqlfiddle.com/#!4/f512d

请注意,这是我的简化示例。我有更大的数据集。

1 个答案:

答案 0 :(得分:1)

由于匹配不能保证唯一,因此您必须编写一个只使用某个决策返回一个匹配的查询。

这是一个使用天真匹配的简化案例,然后在有多个匹配时选择一个值:

merge into to_country t
using (
  select * from (
    select t.rowid as trowid
          ,f.country_code as fcode
          ,t.country_code as tcode
          ,case when t.country_code is null then 1 else
             row_number()
             over (partition by t.country_code
                   order by f.country_code)
           end as match_no
    from from_country f
    left join to_country t
    on f.country_code like t.country_code || '%'
  ) where match_no = 1
  ) s
on (s.trowid = t.rowid)
when matched then update set country_code = s.fcode
when not matched then insert (country_code) values (s.fcode);

导致to_country:

USA
UGANDA
CON CORRECT
United States of America

既然已经完成了这项工作,您只需要使匹配算法更加智能化。在这里你需要查看整个数据集,看看有哪些错误 - 即错别字等。

您可以尝试使用Oracle提供的 UTL_MATCH 中的一些程序:https://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm - 例如EDIT_DISTANCE或JARO_WINKLER。

以下是使用Jaro Winkler算法的示例:

merge into to_country t
using (
  select * from (
    select t.rowid as trowid
          ,f.country_code as fcode
          ,t.country_code as tcode
          ,case when t.country_code is null then 1
           else row_number() over (
                partition by t.country_code
                order by utl_match.jaro_winkler_similarity(f.country_code,t.country_code) desc)
           end as match_no
    from from_country f
    left join to_country t
    on utl_match.jaro_winkler_similarity(f.country_code,t.country_code) > 70
  ) where match_no = 1
  ) s
on (s.trowid = t.rowid)
when matched then update set country_code = s.fcode
when not matched then insert (country_code) values (s.fcode);

SQL小提琴:http://sqlfiddle.com/#!4/f512d/23

请注意,我已经选择了> 70%的任意截止值。这是因为UGANDA与美国的Jaro Winkler相似度为70。

这导致以下结果:

United States of America
USA
UGANDA
CON NEW

要了解这些算法是如何运行的,请执行以下操作:

select f.country_code as fcode
      ,t.country_code as tcode
      ,utl_match.edit_distance_similarity(f.country_code,t.country_code) as ed
      ,utl_match.jaro_winkler_similarity(f.country_code,t.country_code) as jw
from from_country f
cross join to_country t
order by 2, 4 desc;

FCODE                     TCODE    ED   JW
========================  ======  ===  ===
CON NEW                   CON      43   86
CON CORRECT               CON      28   83
UGANDA                    CON      17   50
United States of America  CON       0    0

UGANDA                    UGANDA  100  100
United States of America  UGANDA    9   46
CON NEW                   UGANDA   15   43
CON CORRECT               UGANDA    0   41

UGANDA                    USA      34   70
United States of America  USA      13   62
CON CORRECT               USA       0    0
CON NEW                   USA       0    0

SQL小提琴:http://sqlfiddle.com/#!4/f512d/22