根据匹配百分比进行分组

时间:2017-05-18 12:34:55

标签: sql-server sql-server-2008 tsql

我需要根据匹配百分比

自行加入并显示记录
Name | Village
Jones Ashley, MPK
Meyer Peter, JSK
A Jones, MPK
David, ARK
Peter M, JSK
Peter M, JSK
David, ARK

select 
x.Name, 
y.Name,
dbo.matchname(x.Name, y.Name) 'match'  
from cust x, cust y where dbo.matchname(x.Name, y.Name) >= 80
and x.village = y.village

我写了一个函数,它有两个名字并计算百分比。例如:Peter M和Meyer Peter是80%。

我现在想按匹配百分比显示相关记录顺序。例如:

Jones Ashley, MPK
A Jones, MPK
David, ARK 
David, ARK 
Peter M, JSK
Peter M, JSK
Meyer Peter, JSK

排序依据在这里不起作用,因为初始可能是最后一些名称。我需要某种分组,但不知道该怎么做。

1 个答案:

答案 0 :(得分:1)

我不知道你的matchname标量函数是做什么的所以我只是创建了一个通用的标量函数来比较两个字符串并返回一个小数字。

-- (0) Prep: a matchname function
IF OBJECT_ID('tempdb.dbo.matchname') IS NOT NULL DROP FUNCTION dbo.matchname;
GO
CREATE FUNCTION dbo.matchname(@string1 varchar(40), @string2 varchar(40))
RETURNS int AS
BEGIN RETURN((ABS(ASCII(@string1)+3) - (ASCII(@string2))))*7 END;

以下是一些示例数据和解决方案。最值得注意的是我如何过滤我的CROSS JOIN:

WHERE x.someid < y.someid

这样做可以防止您两次评估相同的记录;例如dbo.matchname(约翰史密斯,乔治华盛顿)&amp; dbo.matchname(乔治华盛顿,约翰史密斯。

示例数据和解决方案

-- Sample data
DECLARE @yourtable TABLE 
(
  someid int identity primary key clustered, 
  [Name] varchar(40), 
  Village varchar(10)
  ,index nc_yt nonclustered([Name] ASC)
);
INSERT @yourtable ([Name], Village) 
VALUES
('Jones Ashley', 'MPK'),
('Meyer Peter', 'JSK'),
('A Jones', 'MPK'),
('David', 'ARK'),
('Peter M', 'JSK'),
('Peter M', 'JSK'),
('David', 'ARK');

-- Solution
WITH uniqueList AS
(
  select 
    rn = ROW_NUMBER() OVER 
        (
          PARTITION BY x.Name, y.Name, dbo.matchname(x.name, y.name) 
          ORDER BY (SELECT NULL)
        ),
    Name1 = x.Name,
    Name2 = y.Name,
    id1 = x.someid, id2 = y.someid,
    dbo.matchname(x.name, y.name) AS match
  from @yourtable x 
  CROSS JOIN @yourtable y
  WHERE x.someid < y.someid 
  AND dbo.matchname(x.Name, y.Name) >= 80
)
SELECT Name1, Name2, match
FROM uniqueList
WHERE rn = 1
ORDER BY match;

现在关于标量值函数... 标量值用户定义函数(简称标量UDF)KILL性能,特别是你如何使用你的! 。您可以使用内联表值函数(iTVF)替换标量UDF以获得最佳性能。

首先是新功能:

IF OBJECT_ID('tempdb.dbo.itvf_matchname') IS NOT NULL DROP FUNCTION dbo.itvf_matchname;
GO
CREATE FUNCTION dbo.itvf_matchname(@string1 varchar(40), @string2 varchar(40))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN(SELECT match = (ABS(ASCII(@string1)+3) - (ASCII(@string2)))*7);

现在是一个解决方案(注意我注释掉了原始的标量UDF代码):

-- sample data
DECLARE @yourtable TABLE 
(
  someid int identity primary key clustered, 
  [Name] varchar(40), 
  Village varchar(10)
  ,index nc_yt nonclustered([Name] ASC)
);
INSERT @yourtable ([Name], Village) 
VALUES
('Jones Ashley', 'MPK'),
('Meyer Peter', 'JSK'),
('A Jones', 'MPK'),
('David', 'ARK'),
('Peter M', 'JSK'),
('Peter M', 'JSK'),
('David', 'ARK');

-- solution
WITH uniqueList AS
(
  select 
    rn = ROW_NUMBER() OVER 
        (
          PARTITION BY x.Name, y.Name, /*dbo.matchname(x.name, y.name)*/ itvf.match 
          ORDER BY (SELECT NULL)
        ),
    Name1 = x.Name,
    Name2 = y.Name,
    --dbo.matchname(x.name, y.name) AS match
    itvf.match
  from @yourtable x 
  CROSS JOIN @yourtable y
  -- Below: only 1 function call with results referenced multiple times
  CROSS APPLY dbo.itvf_matchname(x.Name, y.Name) itvf
  WHERE x.someid < y.someid 
  --AND dbo.matchname(x.Name, y.Name) >= 80
  AND itvf.match >= 80
)
SELECT Name1, Name2, match
FROM uniqueList
WHERE rn = 1;

结果相同但性能明显更好。为了更好地理解为什么要用iTVF替换标量UDF,让我们进行1500行测试(这意味着我们将评估100万行):

-- (3.1) Sample Data with an ID
SET NOCOUNT ON;
IF OBJECT_ID('tempdb..#yourtable') IS NOT NULL DROP TABLE #yourtable;

CREATE TABLE #yourtable 
(
  someid int identity primary key clustered, 
  [Name] varchar(40)  NOT NULL, 
  Village varchar(10) NOT NULL
);
INSERT #yourtable
SELECT TOP (1500) LEFT(REPLACE(newid(),'-',''),10), 'xxx'
FROM sys.all_columns a 
CROSS JOIN sys.all_columns b;
GO
CREATE NONCLUSTERED INDEX nc_yt ON #yourTable([Name] ASC);
GO


PRINT 'Scalar function'+char(13)+char(10)+REPLICATE('-',50);
GO
DECLARE @x bit, @st datetime2 = getdate();
WITH uniqueList AS
(
  select 
    rn = ROW_NUMBER() OVER 
        (
          PARTITION BY x.Name, y.Name, dbo.matchname(x.name, y.name) 
          ORDER BY (SELECT NULL)
        ),
    Name1 = x.Name,
    Name2 = y.Name,
    dbo.matchname(x.name, y.name) AS match
  from #yourtable x 
  CROSS JOIN #yourtable y
  WHERE x.someid < y.someid 
  AND dbo.matchname(x.Name, y.Name) >= 80
)
SELECT @x = 1
FROM uniqueList
WHERE rn = 1;
PRINT DATEDIFF(MS, @st, getdate());
GO 5

PRINT char(13)+char(10)+'ITVF (serial)'+char(13)+char(10)+REPLICATE('-',50);
GO
DECLARE @x bit, @st datetime2 = getdate();
WITH uniqueList AS
(
  select 
    rn = ROW_NUMBER() OVER 
        (
          PARTITION BY x.Name, y.Name, /*dbo.matchname(x.name, y.name)*/ itvf.match 
          ORDER BY (SELECT NULL)
        ),
    Name1 = x.Name,
    Name2 = y.Name,
    --dbo.matchname(x.name, y.name) AS match
    itvf.match
  from #yourtable x 
  CROSS JOIN #yourtable y
  -- Below: only 1 function call with results referenced multiple times
  CROSS APPLY dbo.itvf_matchname(x.Name, y.Name) itvf
  WHERE x.someid < y.someid 
  --AND dbo.matchname(x.Name, y.Name) >= 80
  AND itvf.match >= 80
)
SELECT @x = 1
FROM uniqueList
WHERE rn = 1
OPTION (MAXDOP 1);
PRINT DATEDIFF(MS, @st, getdate());
GO 5

PRINT char(13)+char(10)+'ITVF Parallel'+char(13)+char(10)+REPLICATE('-',50);
GO
DECLARE @x bit, @st datetime2 = getdate();
WITH uniqueList AS
(
  select 
    rn = ROW_NUMBER() OVER 
        (
          PARTITION BY x.Name, y.Name, /*dbo.matchname(x.name, y.name)*/ itvf.match 
          ORDER BY (SELECT NULL)
        ),
    Name1 = x.Name,
    Name2 = y.Name,
    --dbo.matchname(x.name, y.name) AS match
    itvf.match
  from #yourtable x 
  CROSS JOIN #yourtable y
  -- Below: only 1 function call with results referenced multiple times
  CROSS APPLY dbo.itvf_matchname(x.Name, y.Name) itvf
  CROSS APPLY dbo.make_parallel()
  WHERE x.someid < y.someid 
  --AND dbo.matchname(x.Name, y.Name) >= 80
  AND itvf.match >= 80
)
SELECT @x = 1
FROM uniqueList
WHERE rn = 1;
PRINT DATEDIFF(MS, @st, getdate());
GO 5

结果:

Scalar function
--------------------------------------------------
Beginning execution loop
4627
4504
4440
4457
4550
Batch execution completed 5 times.

ITVF (serial)
--------------------------------------------------
Beginning execution loop
1623
1610
1643
1640
1713
Batch execution completed 5 times.

ITVF Parallel
--------------------------------------------------
Beginning execution loop
1306
1067
1077
1127
1047
Batch execution completed 5 times.

基于iTVF的解决方案在使用串行计划运行时速度提高约3倍,并行计划速度提高4倍。