名称拆分和比较

时间:2014-06-24 05:24:25

标签: sql-server

我有一种情况需要执行以下操作:

公司名称:

a. Split text before and after “ - “
b. Generate the report where texts before and after “ - “ matches = exact match
c. Generate the report where texts before and after “ - “ matches = similar matches

我可以到达b点。其中我可以使用以下内容获得具有相同firsthalf和secondhalf(例如abc,inc。 - abc,inc。)名称的结果 -

RTRIM(substring(c.companyname,0,charindex('-',c.companyname)))= LTRIM(substring(c.companyname, charindex('-',c.companyname,0)+1, len(c.companyname)))   

但是,我无法进行下一次报告(例如abc。 - abc OR abc,inc - abc)

有人可以帮忙吗?

1 个答案:

答案 0 :(得分:0)

试试这个?

DECLARE @CompanyNames TABLE (
    CompanyName VARCHAR(512));
INSERT INTO @CompanyNames VALUES ('Walt Disney - Walt Disney');
INSERT INTO @CompanyNames VALUES ('Fun Food - Fun Food');
INSERT INTO @CompanyNames VALUES ('Fun Food, Inc. - Fun Food');
INSERT INTO @CompanyNames VALUES ('Walt Disney - Walt Disney, Inc.');

--Split names
DECLARE @SplitNames TABLE (
    MatchLeft VARCHAR(128),
    MatchRight VARCHAR(128));
INSERT INTO 
    @SplitNames 
SELECT  
    RTRIM(SUBSTRING(CompanyName, 0, CHARINDEX('-', CompanyName))),
    LTRIM(SUBSTRING(CompanyName, CHARINDEX('-', CompanyName, 0) + 1, LEN(CompanyName)))
FROM
    @CompanyNames;

--Exact matches
SELECT 
    MatchLeft,
    MatchRight,
    CASE WHEN MatchLeft = MatchRight THEN 1 ELSE 0 END AS Exact
FROM 
    @SplitNames;

--Inexact matches
WITH CleansedCompanyNames AS (
    SELECT
        MatchLeft AS OriginalMatchLeft,
        MatchRight AS OriginalMatchRight,
        REPLACE(REPLACE(REPLACE(MatchLeft, '.', ''), 'Inc', ''), ',', '') AS MatchLeft,
        REPLACE(REPLACE(REPLACE(MatchRight, '.', ''), 'Inc', ''), ',', '') AS MatchRight
    FROM
        @SplitNames)
SELECT 
    OriginalMatchLeft,
    OriginalMatchRight,
    MatchLeft,
    MatchRight,
    CASE WHEN MatchLeft = MatchRight THEN 1 ELSE 0 END
FROM 
    CleansedCompanyNames;

--Using SOUNDEX
SELECT 
    MatchLeft,
    MatchRight,
    CASE WHEN DIFFERENCE(MatchLeft, MatchRight) >= 3 THEN 1 ELSE 0 END AS Score
FROM 
    @SplitNames;

有两种处理不精确匹配的想法:

  • 在匹配之前删除标点符号和不需要的单词(但这需要建立一个替换内容的列表);或
  • 使用SOUNDEX测试字符串相似性。

或者,要使用原始示例,您可以将其用于SOUNDEX:

SELECT ...
WHERE
DIFFERENCE(RTRIM(substring(c.companyname,0,charindex('-',c.companyname))), LTRIM(substring(c.companyname, charindex('-',c.companyname,0)+1, len(c.companyname)))) >= 3

使用您的最新示例:

DECLARE @Company TABLE (
    companyname VARCHAR(500));
INSERT INTO @Company VALUES ('Allen Limited - Allen Corporation');
INSERT INTO @Company VALUES ('Sweden Corp. - Sweden Corp.');
INSERT INTO @Company VALUES ('Alaska Limited - Alaska Limited, Inc.');
INSERT INTO @Company VALUES ('New York Inc. - New York Steel Limited');
INSERT INTO @Company VALUES ('India Plc - India Plc.');
INSERT INTO @Company VALUES ('Dubai International - Dubai International');
INSERT INTO @Company VALUES ('Nigera Falls Pvt. Ltd. - Amazing Nigeria Falls');
SELECT
    c.companyname,
    DIFFERENCE(RTRIM(SUBSTRING(c.companyname, 0, CHARINDEX('-', c.companyname))), LTRIM(SUBSTRING(c.companyname, CHARINDEX('-', c.companyname, 0) + 1, LEN(c.companyname)))) AS Similarity
FROM
    @Company c;

结果:

companyname Similarity
Allen Limited - Allen Corporation   4
Sweden Corp. - Sweden Corp. 4
Alaska Limited - Alaska Limited, Inc.   4
New York Inc. - New York Steel Limited  4
India Plc - India Plc.  4
Dubai International - Dubai International   4
Nigera Falls Pvt. Ltd. - Amazing Nigeria Falls  1

所以它对你上一个例子的效果不是很好,但对其他人来说似乎没问题呢?