我有以下数据详细信息:
表1:Table1
的大小在few records
左右。
表2:Table2
具有50 millions
行。
要求:我需要将table1
到table2
的任意字符串列匹配,例如将name
列匹配到name
,并获取百分比匹配(注释列可以是任何一个,可能是地址,也可以是单个单元格中具有多个单词的任何字符串列)。
样本数据:
create table table1(id int, name varchar(100), address varchar(200));
insert into table1 values(1,'Mario Speedwagon','H No 10 High Street USA');
insert into table1 values(2,'Petey Cruiser Jack','#1 Church Street UK');
insert into table1 values(3,'Anna B Sthesia','#101 No 1 B Block UAE');
insert into table1 values(4,'Paul A Molive','Main Road 12th Cross H No 2 USA');
insert into table1 values(5,'Bob Frapples','H No 20 High Street USA');
create table table2(name varchar(100), address varchar(200), email varchar(100));
insert into table2 values('Speedwagon Mario ','USA, H No 10 High Street','mario@gmail.com');
insert into table2 values('Cruiser Petey Jack','UK #1 Church Street','jack@gmail.com');
insert into table2 values('Sthesia Anna','UAE #101 No 1 B Block','Aanna@gmail.com');
insert into table2 values('Molive Paul','USA Main Road 12th Cross H No 2','APaul@gmail.com');
insert into table2 values('Frapples Bob ','USA H No 20 High Street','BobF@gmail.com');
预期结果:
tbl1_Name tbl2_Name Percentage
--------------------------------------------------------
Mario Speedwagon Speedwagon Mario 100
Petey Cruiser Jack Cruiser Petey Jack 100
Anna B Sthesia Sthesia Anna around 80+
Paul A Molive Molive Paul around 80+
Bob Frapples Frapples Bob 100
注意:上面给出的只是示例数据,我在实际情况中在few records
中有table1
,在50 millions
中有table2
。
我的尝试:
步骤1 :根据Shnugo的建议,具有规范化的数据并存储在同一表中。
对于表1:
ALTER TABLE table1 ADD Name_Normal VARCHAR(1000);
GO
--00:00:00 (5 row(s) affected)
UPDATE table1
SET Name_Normal=CAST('<x>' + REPLACE((SELECT LOWER(name) AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML)
.query(N'
for $fragment in distinct-values(/x/text())
order by $fragment
return $fragment
').value('.','nvarchar(1000)');
GO
对于表2:
ALTER TABLE table2 ADD Name_Normal VARCHAR(1000);
GO
--01:59:03 (50000000 row(s) affected)
UPDATE table2
SET Name_Normal=CAST('<x>' + REPLACE((SELECT LOWER(name) AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML)
.query(N'
for $fragment in distinct-values(/x/text())
order by $fragment
return $fragment
').value('.','nvarchar(1000)');
GO
步骤2 :使用Levenshtein distance in Microsoft Sql Server
创建百分比计算功能步骤3 :查询以获取匹配百分比。
--00:00:33 (23456 row(s) affected)
SELECT t.name AS [tbl1_Name],t1.name AS [tbl2_Name],
dbo.ufn_Levenshtein(t.Name_Normal,t1.Name_Normal) percentage
into #TempTable
FROM table2 t
INNER JOIN table1 t1
ON CHARINDEX(SOUNDEX(t.Name_Normal),SOUNDEX(t1.Name_Normal))>0
--00:00:00 (23456 row(s) affected)
SELECT *
FROM #TempTable
WHERE percentage >= 50
order by percentage desc;
结论:正在获得预期的结果,但是如上查询中的注释所述,要归一化2 hours
大约需要table2
。有没有建议在step 1
上为table2
进行更好的优化?
答案 0 :(得分:0)
您是否尝试过研究DQS(数据质量服务)? 取决于您的SQL版本,它随安装文件一起提供。 https://docs.microsoft.com/en-us/sql/data-quality-services/data-matching?view=sql-server-2017