SQL创建链接的模糊记录

时间:2017-12-06 14:39:40

标签: sql oracle

如果我有一组记录:

ID  DESCRIPTION     VALUE
1    HORSE JOCKEY     200      
2    HORSE JOCK       300
3    SOCKS            50
4    HORSE JOCKE      200

我已经运行了一个字符串距离函数,只包括一对高概率匹配

ID1  DESCRIPTION1    VALUE1  ID2  DESCRIPTION2     VALUE2  STRING_DISTANCE
 1    HORSE JOCKEY     200    2    HORSE JOCK        300        95
 1    HORSE JOCKEY     200    4    HORSE JOCKE       200        97
 2    HORSE JOCK       300    4    HORSE JOCKE       200        98

如何将这些记录链接起来创建一个唯一的ID(比如10匹马赛马),假设上述所有内容都应该是HORSE JOCKEY但是有拼写错误? Similair到下面:

ID1  DESCRIPTION1    VALUE1  ID2  DESCRIPTION2     VALUE2  STRING_DISTANCE UNIQUE_ID   
 1    HORSE JOCKEY     200    2    HORSE JOCK        300        95             10
 1    HORSE JOCKEY     200    4    HORSE JOCKE       200        97             10
 2    HORSE JOCK       300    4    HORSE JOCKE       200        98             10

最终寻找

ID  DESCRIPTION     VALUE   UNIQUE_ID
1    HORSE JOCKEY     200      10
2    HORSE JOCK       300      10
3    SOCKS            50       11
4    HORSE JOCKE      200      10

1 个答案:

答案 0 :(得分:2)

返回的唯一ID将是组中具有最长字符串长度的项目的ID(因为其他字符串看起来是该字符串的子字符串),并且如果有多个最长的描述,那么使用这些ID的最小值:

SQL Fiddle

Oracle 11g R2架构设置

CREATE TABLE your_table ( ID, DESCRIPTION, VALUE ) AS
SELECT 1, 'HORSE JOCK', 300 FROM DUAL UNION ALL
SELECT 2, 'HORSE JOCKEY', 200 FROM DUAL UNION ALL
SELECT 3, 'SOCKS', 50 FROM DUAL UNION ALL
SELECT 4, 'HORSE JOCKE', 200 FROM DUAL;

查询1

SELECT id,
       description,
       value,
       MIN( CONNECT_BY_ROOT( id ) )
         KEEP ( DENSE_RANK LAST ORDER BY LENGTH( CONNECT_BY_ROOT( description ) ) )
         AS unique_id
FROM   your_table t
CONNECT BY NOCYCLE
       UTL_MATCH.JARO_WINKLER ( PRIOR description,  description ) >= 0.95
GROUP BY
       id,
       description,
       value

<强> Results

| ID |  DESCRIPTION | VALUE | UNIQUE_ID |
|----|--------------|-------|-----------|
|  1 |   HORSE JOCK |   300 |         2 |
|  2 | HORSE JOCKEY |   200 |         2 |
|  3 |        SOCKS |    50 |         3 |
|  4 |  HORSE JOCKE |   200 |         2 |