根据地址相似度查找示例

时间:2019-11-07 13:50:57

标签: sql sql-server tsql

我有一个包含三个字段的数据集:地址,从地址中剥离的数字和从地址中剥离的字母。

IF OBJECT_ID ('tempdb..#addresses') IS NOT NULL
DROP TABLE #addresses

create table #addresses (
    address_numbers varchar(50),
    address_all varchar(100),
    address_letters varchar(100)
)

insert into #addresses
values ('12345678','123 Something Rd, Somewhere NY 45678', 'SOMETHINGRDSOMEWHERENY'),
       ('12345678','123 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY'),
       ('23445678','234 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY')

我想在相同的剥离数字内按相似度查找地址组。我知道如何找到两个文本字符串之间的相似性...

select *
from #addresses a
left outer join #addresses b on a.address_numbers = b.address_numbers and MDS_DB.MDQ.SIMILARITY(a.address_letters ,b.address_letters , 2, 0, .90) >= .90

...但是我不确定如何为原始数据中的每个地址分配示例代码/分组代码。所需的结果如下所示:

IF OBJECT_ID ('tempdb..#addresses_desired_result') IS NOT NULL
DROP TABLE #addresses_desired_result

create table #addresses_desired_result (
    address_numbers varchar(50),
    address_all varchar(100),
    address_letters varchar(100),
    address_group varchar(100)
)

insert into #addresses_desired_result
values ('12345678','123 Something Rd, Somewhere NY 45678', 'SOMETHINGRDSOMEWHERENY', '123 Something Rd, Somewhere NY 45678'),
       ('12345678','123 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY', '123 Something Rd, Somewhere NY 45678'),
       ('23445678','234 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY', '234 Something Road, Somewhere NY 45678')

select *
from #addresses_desired_result

address_group可以是组中的地址之一,也可以只是整数。目标是通过示例/组号将地址和示例的不同列表重新加入到更大的交易表和组记录中。

如何为相同剥离编号内的每组相似地址分配示例地址/组号?

1 个答案:

答案 0 :(得分:1)

要澄清一下:

IF OBJECT_ID ('tempdb..#addresses') IS NOT NULL
DROP TABLE #addresses

create table #addresses (
    id int identity(1,1),
    address_numbers varchar(50),
    address_all varchar(100),
    address_letters varchar(100)
)

insert into #addresses
values ('12345678','123 Something Rd, Somewhere NY 45678', 'SOMETHINGRDSOMEWHERENY'),
       ('12345678','123 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY'),
       ('23445678','234 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY')


select A.address_numbers, A.address_all, A.address_letters, 
  isnull(B.address_all, A.address_all) as address_group
from #addresses A
left join
(
select A.id, B.address_all,
  row_number() over(order by case when B.address_all + ' ' like '% rd %' then 1 when B.address_all + ' ' like '% road %' then 2 end,
    case when B.address_all + ' ' like '% st %' then 1 when B.address_all + ' ' like '% street %' then 2 end) AS RowNr
from #addresses A
  cross join #addresses B
  where left(A.address_all, 5) = left(b.address_all, 5)  --place similarity function here
   and A.id <> B.id
) B on A.id = B.id and B.RowNr = 1

结果:

address_numbers address_all                             address_letters             address_group
12345678        123 Something Rd, Somewhere NY 45678    SOMETHINGRDSOMEWHERENY      123 Something Rd, Somewhere NY 45678
12345678        123 Something Road, Somewhere NY 45678  SOMETHINGROADSOMEWHERENY    123 Something Rd, Somewhere NY 45678
23445678        234 Something Road, Somewhere NY 45678  SOMETHINGROADSOMEWHERENY    234 Something Road, Somewhere NY 45678   

我用left(address_all,5)代替了相似性函数,但是您可以执行任何喜欢的计算。