如果我有一组记录:
ID DESCRIPTION VALUE
1 HORSE JOCKEY 200
2 HORSE JOCK 300
3 SOCKS 50
4 HORSE JOCKE 200
我已经运行了一个字符串距离函数,只包括一对高概率匹配
ID1 DESCRIPTION1 VALUE1 ID2 DESCRIPTION2 VALUE2 STRING_DISTANCE
1 HORSE JOCKEY 200 2 HORSE JOCK 300 95
1 HORSE JOCKEY 200 4 HORSE JOCKE 200 97
2 HORSE JOCK 300 4 HORSE JOCKE 200 98
如何将这些记录链接起来创建一个唯一的ID(比如10匹马赛马),假设上述所有内容都应该是HORSE JOCKEY但是有拼写错误? Similair到下面:
ID1 DESCRIPTION1 VALUE1 ID2 DESCRIPTION2 VALUE2 STRING_DISTANCE UNIQUE_ID
1 HORSE JOCKEY 200 2 HORSE JOCK 300 95 10
1 HORSE JOCKEY 200 4 HORSE JOCKE 200 97 10
2 HORSE JOCK 300 4 HORSE JOCKE 200 98 10
最终寻找
ID DESCRIPTION VALUE UNIQUE_ID
1 HORSE JOCKEY 200 10
2 HORSE JOCK 300 10
3 SOCKS 50 11
4 HORSE JOCKE 200 10
答案 0 :(得分:2)
返回的唯一ID将是组中具有最长字符串长度的项目的ID(因为其他字符串看起来是该字符串的子字符串),并且如果有多个最长的描述,那么使用这些ID的最小值:
Oracle 11g R2架构设置:
CREATE TABLE your_table ( ID, DESCRIPTION, VALUE ) AS
SELECT 1, 'HORSE JOCK', 300 FROM DUAL UNION ALL
SELECT 2, 'HORSE JOCKEY', 200 FROM DUAL UNION ALL
SELECT 3, 'SOCKS', 50 FROM DUAL UNION ALL
SELECT 4, 'HORSE JOCKE', 200 FROM DUAL;
查询1 :
SELECT id,
description,
value,
MIN( CONNECT_BY_ROOT( id ) )
KEEP ( DENSE_RANK LAST ORDER BY LENGTH( CONNECT_BY_ROOT( description ) ) )
AS unique_id
FROM your_table t
CONNECT BY NOCYCLE
UTL_MATCH.JARO_WINKLER ( PRIOR description, description ) >= 0.95
GROUP BY
id,
description,
value
<强> Results 强>:
| ID | DESCRIPTION | VALUE | UNIQUE_ID |
|----|--------------|-------|-----------|
| 1 | HORSE JOCK | 300 | 2 |
| 2 | HORSE JOCKEY | 200 | 2 |
| 3 | SOCKS | 50 | 3 |
| 4 | HORSE JOCKE | 200 | 2 |