嘿,我有2个包含很多列的表,我想找到table1.somecolumn中包含table1.somecolumn值的那些行。例如:
table1.somecolumn有史密斯,彼得和
table2.someothercolumn有 peter.smith
这应该是匹配,我该怎么做这样的搜索?
谢谢:)
答案 0 :(得分:2)
您可以尝试SOUNDEX
或DIFFERENCE
函数来帮助匹配字符串文字。
示例:
select difference('peter.green', 'Green, Peter')
返回2
,其中:
返回的整数是 SOUNDEX值中的字符 是相同的。返回值范围 从0到4:0表示弱或 没有相似之处,4表示强烈 相似性或相同的值。
请参阅MSDN上的SOUNDEX和DIFFERENCE主题。
<强>更新强>
Soundex&amp;在考虑单词的顺序时,差异可能无法正常运行,但如果您安装了全文索引功能,则无需创建索引即可使用全文引擎的分词和解析功能。假设您正在使用SQL Server 2008,以下函数将返回一个标准化术语列表:
SELECT * FROM sys.dm_fts_parser('"Peter Green"', 1033, 0, 0)
您可以通过CROSS APPLY
查询剩余的查询。
请参阅sys.dm_fts_parser主题&amp;部分K.在FROM主题中使用Apply获取更多信息。
示例:(启用了全文引擎的SQL Server Enterprise 2008)
if not OBJECT_ID('Names1', 'Table') is null drop table names1
if not OBJECT_ID('Names2', 'Table') is null drop table names2
create table Names1
(
id int identity(0, 1),
name nvarchar(128)
)
insert into Names1 (name) values ('Green, Peter')
insert into Names1 (name) values ('Smith, Peter')
insert into Names1 (name) values ('Aadland, Beverly')
insert into Names1 (name) values ('Aalda, Mariann')
insert into Names1 (name) values ('Aaliyah')
insert into Names1 (name) values ('Aames, Angela')
insert into Names1 (name) values ('Aames, Willie')
insert into Names1 (name) values ('Aaron, Caroline')
insert into Names1 (name) values ('Aaron, Quinton')
insert into Names1 (name) values ('Aaron, Victor')
insert into Names1 (name) values ('Abbay, Peter')
insert into Names1 (name) values ('Abbott, Dorothy')
insert into Names1 (name) values ('Abbott, Bruce')
insert into Names1 (name) values ('Abbott, Bud')
insert into Names1 (name) values ('Abbott, Philip')
insert into Names1 (name) values ('Abdoo, Rose')
insert into Names1 (name) values ('Abdul, Paula')
insert into Names1 (name) values ('Abel, Jake')
insert into Names1 (name) values ('Abel, Walter')
insert into Names1 (name) values ('Abeles, Edward')
insert into Names1 (name) values ('Abell, Tim')
insert into Names1 (name) values ('Aber, Chuck')
create table Names2
(
id int identity(200, 1),
name nvarchar(128)
)
insert into Names2 (name) values (LOWER('Peter.Green'))
insert into Names2 (name) values (LOWER('Peter.Smith'))
insert into names2 (name) values (LOWER('Beverly.Aadland'))
insert into names2 (name) values (LOWER('Mariann.Aalda'))
insert into names2 (name) values (LOWER('Aaliyah'))
insert into names2 (name) values (LOWER('Angela.Aames'))
insert into names2 (name) values (LOWER('Willie.Aames'))
insert into names2 (name) values (LOWER('Caroline.Aaron'))
insert into names2 (name) values (LOWER('Quinton.Aaron'))
insert into names2 (name) values (LOWER('Victor.Aaron'))
insert into names2 (name) values (LOWER('Peter.Abbay'))
insert into names2 (name) values (LOWER('Dorothy.Abbott'))
insert into names2 (name) values (LOWER('Bruce.Abbott'))
insert into names2 (name) values (LOWER('Bud.Abbott'))
insert into names2 (name) values (LOWER('Philip.Abbott'))
insert into names2 (name) values (LOWER('Rose.Abdoo'))
insert into names2 (name) values (LOWER('Paula.Abdul'))
insert into names2 (name) values (LOWER('Jake.Abel'))
insert into names2 (name) values (LOWER('Walter.Abel'))
insert into names2 (name) values (LOWER('Edward.Abeles'))
insert into names2 (name) values (LOWER('Tim.Abell'))
insert into names2 (name) values (LOWER('Chuck.Aber'));
with ftsNamesFirst (id, term) as
(
select id, terms.display_term
from names1 cross apply sys.dm_fts_parser('"' + name + '"', 1033, 0, 0) terms
), ftsNamesSecond (id, term) as
(
select id, terms.display_term
from names2 cross apply sys.dm_fts_parser('"' + name + '"', 1033, 0, 0) terms
)
select * from
(
select
ROW_NUMBER() over (partition by nfirst.id order by sum(DIFFERENCE(ftsNamesFirst.term, ftsNamesSecond.term)) desc) ranking,
sum(DIFFERENCE(ftsNamesFirst.term, ftsNamesSecond.term)) Confidence,
nFirst.id Names1ID,
nFirst.name Names1Name,
nSecond.id Names2ID,
nSecond.name Names2Name
from
ftsNamesFirst cross join ftsNamesSecond
left outer join names1 nFirst on nFirst.id = ftsNamesFirst.id
left outer join names2 nSecond on nSecond.id = ftsNamesSecond.id
where DIFFERENCE(ftsNamesFirst.term, ftsNamesSecond.term) = 4
group by
nFirst.id, nFirst.name, nSecond.id, nSecond.name
) MatchedNames
where ranking = 1
<强>输出:强>
具有最高置信度的匹配优先(使用窗口排名查询过滤掉所有其他匹配)。
Confidence Names1ID Names1Name Names2ID Names2Name
8 0 Green, Peter 200 peter.green
8 1 Smith, Peter 201 peter.smith
8 2 Aadland, Beverly 202 beverly.aadland
8 3 Aalda, Mariann 203 mariann.aalda
4 4 Aaliyah 204 aaliyah
8 5 Aames, Angela 205 angela.aames
8 6 Aames, Willie 206 willie.aames
这并不完美,但这是一个很好的起点,可以通过调整来提高成功率。
答案 1 :(得分:1)
根据您的需要,有几种可能的解决方案: 使用可以创建辅助表来存储每个记录的关键字