如果一个中的所有单词都包含在另一列中,则匹配两列

时间:2020-01-12 06:40:58

标签: sql sql-server fuzzy-logic

我正在尝试将2列(在2个单独的表上)链接在一起,以便如果一列中的每个单词都包含在另一列中,那么它们将匹配。

例如,以下值应匹配:

Paul Smith|Paul Andrew Smith
Paul Smith|Paul Andrew William Smith
Paul William Smith|Paul Andrew William Smith
Paul Andrew Smith|Paul Smith

但以下内容不匹配:

Paul William Smith|Paul Andrew Smith

我正在使用SQL Server 2016。

我想通过SELECT查询来做到这一点。我有一个模糊的想法,使用string_split函数(在空格上),交叉应用2个表,然后使用MAX函数,但是如果我只处理少数几行,这将创建几百万行一千个名字,这样效率不高。

样本数据:

DROP TABLE IF EXISTS #TEMP1
DROP TABLE IF EXISTS #TEMP2

CREATE TABLE #TEMP1 (NAME NVARCHAR(300))
CREATE TABLE #TEMP2 (NAME NVARCHAR(300))

INSERT #TEMP1 SELECT 'Paul Smith'
INSERT #TEMP1 SELECT 'Amy Nicholas Stanton'
INSERT #TEMP1 SELECT 'Andrew James Thomas'

INSERT #TEMP2 SELECT 'Paul Andrew Smith'
INSERT #TEMP2 SELECT 'Amy Stanton'
INSERT #TEMP2 SELECT 'Andrew Marcus Thomas'

因此,从示例数据来看,前2行应该匹配,而3行应该不匹配。

编辑:我已经将模糊的想法付诸实践,以下解决方案有效,但是正如我所期望的,当您处理包含数千行的表时,它确实很慢。

SELECT DISTINCT A.[FIRSTNAME],A.[SECONDNAME]
FROM (
    SELECT *
          ,MIN([FIRSTMATCH]) OVER(PARTITION BY [SRN],[FIRSTNAME]) [FM]
          ,MIN([SECONDMATCH]) OVER(PARTITION BY [FRN],[SECONDNAME]) [SM]
    FROM (
            SELECT  DISTINCT A.NAME [FIRSTNAME]
                            ,B.NAME [SECONDNAME]
                            ,A.value [FIRSTVAL]
                            ,MAX(IIF(A.VALUE=B.VALUE,1,0)) OVER(PARTITION BY A.VALUE,B.RN) [FIRSTMATCH]
                            ,B.value [SECONDVAL]
                            ,MAX(IIF(B.VALUE=A.VALUE,1,0)) OVER(PARTITION BY B.VALUE,A.RN)  [SECONDMATCH]
                            ,A.RN [FRN]
                            ,B.RN [SRN]
            FROM (
                    SELECT DISTINCT NAME, DENSE_RANK() OVER(ORDER BY NAME) [RN],value
                    FROM #TEMP1
                    CROSS APPLY STRING_SPLIT(LTRIM(RTRIM(NAME)),' ')
                    WHERE LTRIM(RTRIM(NAME)) !=''
            )A
            CROSS APPLY(
                    SELECT DISTINCT NAME, DENSE_RANK() OVER(ORDER BY NAME) [RN],value
                    FROM #TEMP2
                    CROSS APPLY STRING_SPLIT(LTRIM(RTRIM(NAME)),' ')
                    WHERE LTRIM(RTRIM(NAME)) !=''
            )B 
    )A
)A
WHERE A.SM = 1 OR A.FM = 1

2 个答案:

答案 0 :(得分:1)

您可以拆分字符串并进行聚合。假设这些名称都没有重复的部分:

with n1 as (
      select temp1.name, value as part, count(value) over (partition by name) as num_parts
      from temp1 cross apply
           string_split(temp1.name, ' ')
     ),
     n2 as (
      select temp2.name, value as part, count(value) over (partition by name) as num_parts
      from temp2 cross apply
           string_split(temp2.name, ' ') 
     )
select n1.name, n2.name
from n1 join
     n2
     on n1.part = n2.part and n1.num_parts <= n2.num_parts
group by n1.name, n2.name, n1.num_parts
having count(*) = n1.num_parts;

Here是db <>小提琴。

答案 1 :(得分:0)

以戈登·利诺夫(Gordon Linoff)的回答为基础,这似乎行得通:

;WITH N1 AS (
      SELECT *,COUNT(*) OVER(PARTITION BY NAME) [NUM_PARTS]
      FROM (
            SELECT DISTINCT NAME, VALUE [PART]
            FROM #TEMP1 CROSS APPLY
                 STRING_SPLIT(#TEMP1.NAME, ' ')
           )A
     ),
     N2 AS (
      SELECT *,COUNT(*) OVER(PARTITION BY NAME) [NUM_PARTS]
      FROM (
            SELECT DISTINCT NAME, VALUE [PART]
            FROM #TEMP2 CROSS APPLY
           STRING_SPLIT(#TEMP2.NAME, ' ')
           )A 
     )
SELECT N1.NAME, N2.NAME
FROM N1 JOIN N2 ON N1.PART = N2.PART
group by n1.name, n2.name, n1.num_parts,n2.num_parts
having count(n2.part) = n1.num_parts
or     count(n1.part) = n2.num_parts