Question

我试图在两个相等但不完全匹配的表的列上创建一个简单的表联接。例如，表A中的行值可能是“ Georgia Production”，而表B中相应的行值可能是“ Georgia Independent Production Co”。

我首先在联接中尝试了通配符，如下所示：

select BOLFlatFile.*, customers.City, customers.FEIN_Registration_No, customers.ST

 from BOLFlatFile

 Left Join Customers on (customers.Name Like '%'+BOLFlatFile.Customer+'%');

，这非常适合90％的数据。但是，如果表A中的字符串未完全出现在表B中，则它将返回null。回到上面的示例，如果表A的值是“ Georgia Independent”，则可以使用，但是如果表是“ Georgia Production，则不能。”

Answer 1

这可能仍然是一种错误的复杂方法，但这适用于我模拟的示例。

假设是因为您正在“通配符搜索”从一个表到另一个表的字符串，所以我假设第一表列中的所有单词都出现在第二表列中，这意味着默认情况下第二表列中的字符串总是比表中的第一列更长。

第二个假设是第一个表上有一个唯一的ID，如果没有，则可以使用row_number函数并在字符串列上排序来创建一个唯一的ID。

下面的方法首先创建一些示例数据（我使用tablea和tableb表示您的表）。

然后创建一个虚拟表来存储您的第一个表和字符串列的唯一标识。

接下来，调用一个循环以遍历虚拟表中的字符串，并将唯一ID和字符串的第一部分插入，然后在处理程序表中插入一个空格，这是将两个目标表连接在一起的方法

下一部分将使用唯一的ID将第一个表连接到处理程序表，然后将长度超过3个字母（避免使用“ the”和“ etc”）的关键字将第二个表连接到处理程序表，并返回到第一个表是假设表b中的字符串比表a中的长（因为您要在表b的对应列中查找表a中的每个单词的实例，因此是假设）。

declare @tablea table (
    id int identity(1,1),
    helptext nvarchar(50)
);


declare @tableb table (
    id int identity(1,1),
    helptext nvarchar(50)
);

insert @tablea (helptext)
values
('Text to find'),
('Georgia Production'),
('More to find');

insert @tableb (helptext)
values
('Georgia Independent Production'),
('More Text to Find'),
('something Completely different'),
('Text to find');

declare @stringtable table (
    id int,
    string nvarchar(50)
);

declare @stringmatch table (
    id int,
    stringmatch nvarchar(20)
);

insert @stringtable (id, string)
select id, helptext from @tablea;

update @stringtable set string = string + ' ';

while exists (select 1 from @stringtable)
    begin
        insert @stringmatch (id, stringmatch)
        select id, substring(string,1,charindex(' ',string)) from @stringtable;
        update @stringmatch set stringmatch = ltrim(rtrim(stringmatch));
        update @stringtable set string=replace(string, stringmatch, '') from @stringtable tb inner join @stringmatch ma
        on tb.id=ma.id and charindex(ma.stringmatch,tb.string)>0;
        update @stringtable set string=LTRIM(string);
        delete from @stringtable where string='' or string is null;
    end


        select a.*, b.* from @tablea a inner join @stringmatch m on a.id=m.id
        inner join @tableb b on CHARINDEX(m.stringmatch,b.helptext)>0 and len(b.helptext)>len(a.helptext);

Answer 2

这完全取决于您要进行此匹配的复杂程度。匹配这些字符串的方法多种多样，有些可能比其他方法更好。下面是一个示例，说明如何使用BOLFlatFile将Customers和string_split表中的名称拆分为单独的单词。

下面的示例将匹配BOLFlatFile customer字段中customers name字段中所有单词的所有内容（注意：它不会考虑字符串的帐户顺序）。

下面的代码将按预期匹配前两个字符串，但不匹配最后两个示例字符串。

CREATE TABLE BOLFlatFile
(
    [customer] NVARCHAR(500)
)

CREATE TABLE Customers
(
    [name] NVARCHAR(500)
)


INSERT INTO Customers VALUES ('Georgia Independent Production Co')
INSERT INTO BOLFlatFile VALUES ('Georgia Production')
INSERT INTO Customers VALUES ('Test String 1')
INSERT INTO BOLFlatFile VALUES ('Test 1')
INSERT INTO Customers VALUES ('Test String 2')
INSERT INTO BOLFlatFile VALUES ('Test 3')

;with BOLFlatFileSplit
as
(
    SELECT *, 
        COUNT(*) OVER(PARTITION BY [customer]) as [WordsInName]
    FROM 
        BOLFlatFile
    CROSS APPLY 
        STRING_SPLIT([customer], ' ')
),
CustomerSplit as 
(
    SELECT *
    FROM 
        Customers
    CROSS APPLY 
        STRING_SPLIT([name], ' ')
)
SELECT 
    a.Customer, 
    b.name
FROM 
    CustomerSplit b
INNER JOIN 
    BOLFlatFileSplit a
ON 
    a.value = b.value
GROUP BY 
    a.Customer, b.name
HAVING 
    COUNT(*) = MAX([WordsInName])

如何联接包含SQL Server中不完全匹配的字符串的列？

2 个答案: