我有2个表:from_country和to_country。我想将新记录和更新记录带到to_country
定义和数据
--
CREATE TABLE from_country
(
country_code varchar2(255) not null
);
--
CREATE TABLE to_country
(
country_code varchar2(255) not null
);
-- Meaning match
INSERT INTO from_country
(country_code)
VALUES
('United States of America');
-- Match 100%
INSERT INTO from_country
(country_code)
VALUES
('UGANDA');
-- Meaning match, but with domain knowledge
INSERT INTO from_country
(country_code)
VALUES
('CON CORRECT');
-- Brand new country
INSERT INTO from_country
(country_code)
VALUES
('NEW');
--
INSERT INTO to_country
(country_code)
VALUES
('USA');
-- Match 100%
INSERT INTO to_country
(country_code)
VALUES
('UGANDA');
-- Meaning match, but with domain knowledge
INSERT INTO to_country
(country_code)
VALUES
('CON');
我需要运行merge,因此我将数据从from_county
引入to_country
这是我的第一次尝试,但它只是平等,这还不够好。我需要一些聪明才能使它能够进行意义匹配。 如果有人知道如何操作,请提供您的解决方案。
merge into
to_country to_t
using
from_country from_t
on
(to_t.country_code = from_t.country_code)
when not matched then insert (
country_code
)
values (
from_t.country_code
);
简而言之,这就是我想要的
from_table:
United States of America
UGANDA
CON CORRECT
NEW
to_table:
USA
UGANDA
CON
oracle合并到
之后the new to_country table:
United States of America
UGANDA
CON CORRECT
NEW
sql fiddle:http://sqlfiddle.com/#!4/f512d
请注意,这是我的简化示例。我有更大的数据集。
答案 0 :(得分:1)
由于匹配不能保证唯一,因此您必须编写一个只使用某个决策返回一个匹配的查询。
这是一个使用天真匹配的简化案例,然后在有多个匹配时选择一个值:
merge into to_country t
using (
select * from (
select t.rowid as trowid
,f.country_code as fcode
,t.country_code as tcode
,case when t.country_code is null then 1 else
row_number()
over (partition by t.country_code
order by f.country_code)
end as match_no
from from_country f
left join to_country t
on f.country_code like t.country_code || '%'
) where match_no = 1
) s
on (s.trowid = t.rowid)
when matched then update set country_code = s.fcode
when not matched then insert (country_code) values (s.fcode);
导致to_country:
USA
UGANDA
CON CORRECT
United States of America
既然已经完成了这项工作,您只需要使匹配算法更加智能化。在这里你需要查看整个数据集,看看有哪些错误 - 即错别字等。
您可以尝试使用Oracle提供的 UTL_MATCH 中的一些程序:https://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm - 例如EDIT_DISTANCE或JARO_WINKLER。
以下是使用Jaro Winkler算法的示例:
merge into to_country t
using (
select * from (
select t.rowid as trowid
,f.country_code as fcode
,t.country_code as tcode
,case when t.country_code is null then 1
else row_number() over (
partition by t.country_code
order by utl_match.jaro_winkler_similarity(f.country_code,t.country_code) desc)
end as match_no
from from_country f
left join to_country t
on utl_match.jaro_winkler_similarity(f.country_code,t.country_code) > 70
) where match_no = 1
) s
on (s.trowid = t.rowid)
when matched then update set country_code = s.fcode
when not matched then insert (country_code) values (s.fcode);
SQL小提琴:http://sqlfiddle.com/#!4/f512d/23
请注意,我已经选择了> 70%的任意截止值。这是因为UGANDA与美国的Jaro Winkler相似度为70。
这导致以下结果:
United States of America
USA
UGANDA
CON NEW
要了解这些算法是如何运行的,请执行以下操作:
select f.country_code as fcode
,t.country_code as tcode
,utl_match.edit_distance_similarity(f.country_code,t.country_code) as ed
,utl_match.jaro_winkler_similarity(f.country_code,t.country_code) as jw
from from_country f
cross join to_country t
order by 2, 4 desc;
FCODE TCODE ED JW
======================== ====== === ===
CON NEW CON 43 86
CON CORRECT CON 28 83
UGANDA CON 17 50
United States of America CON 0 0
UGANDA UGANDA 100 100
United States of America UGANDA 9 46
CON NEW UGANDA 15 43
CON CORRECT UGANDA 0 41
UGANDA USA 34 70
United States of America USA 13 62
CON CORRECT USA 0 0
CON NEW USA 0 0