SQL - 列中的类似数据

时间:2016-06-10 09:06:35

标签: sql sql-server

有没有办法在列中找到类似的结果。示例:

enter image description here

我希望查询从没有4棵绿树的表数据返回,因为没有与绿树相似的数据,但是蓝色车类似于蓝色车,红色玩偶类似于红色小车。

enter image description here

怎么做?

我使用的是microsoft sql server managment studio

3 个答案:

答案 0 :(得分:4)

您可以使用SOUNDEX执行此操作。

样本数据;

CREATE TABLE #SampleData (Column1 int, Column2 varchar(10))
INSERT INTO #SampleData (Column1, Column2)
VALUES
(1,'blue car')
,(2,'red doll')
,(3,'blue cars')
,(4,'green tree')
,(5,'red dolly')

以下代码将使用soundexcolumn2中创建类似条目的列表。然后,它使用不同的子查询来查看该soundex字段出现的次数;

SELECT
a.GroupingField
,a.Title
,b.SimilarFields
FROM (
        SELECT
        SOUNDEX(Column2) GroupingField
        ,MAX(Column2) Title --Just return a unique title for this soundex group
        FROM #SampleData
        GROUP BY SOUNDEX(Column2)
      ) a
LEFT JOIN   (
                SELECT
                SOUNDEX(Column2) GroupingField
                ,COUNT(Column2) SimilarFields --How many fields are in the soundex group?
                FROM #SampleData
                GROUP BY SOUNDEX(Column2)
            ) b
ON a.GroupingField = b.GroupingField
WHERE b.SimilarFields > 1

结果看起来像这样(我已经离开了soundex字段来显示它的样子);

GroupingField   Title       SimilarFields
B400            blue cars   2
R300            red dolly   2

关于soundex https://msdn.microsoft.com/en-gb/library/ms187384.aspx

的进一步阅读

编辑:根据您的要求,为了获取原始数据,您也可以进入临时表,更改我给您的查询,在INTO语句之前放置FROM;

SELECT
a.GroupingField
,a.Title
,b.SimilarFields
INTO #Duplicates
FROM (
        SELECT
        SOUNDEX(Column2) GroupingField
        ,MAX(Column2) Title --Just return a unique title for this soundex group
        FROM #SampleData
        GROUP BY SOUNDEX(Column2)
      ) a
LEFT JOIN   (
                SELECT
                SOUNDEX(Column2) GroupingField
                ,COUNT(Column2) SimilarFields --How many fields are in the soundex group?
                FROM #SampleData
                GROUP BY SOUNDEX(Column2)
            ) b
ON a.GroupingField = b.GroupingField
WHERE b.SimilarFields > 1

然后使用以下查询链接回原始数据;

SELECT
a.GroupingField
,a.Title
,a.SimilarFields
,b.Column1
,b.Column2
FROM #Duplicates a
JOIN #SampleData b
ON a.GroupingField = SOUNDEX(b.Column2)
ORDER BY a.GroupingField

会得到以下结果;

GroupingField   Title       SimilarFields   Column1     Column2
B400            blue cars   2               1           blue car
B400            blue cars   2               3           blue cars
R300            red dolly   2               5           red dolly
R300            red dolly   2               2           red doll

记得

DROP TABLE #Differences

答案 1 :(得分:1)

正如Gar正确评论的那样,你必须定义你的意思是什么"相似性"。 但是,如果您只需要一些固定数字(在您的示例中为8)的相同字符,您可以执行以下操作:

create table myTest
(
    id int,
    name varchar(20)
);

insert into myTest values(1, 'blue car');
insert into myTest values(2, 'red doll');
insert into myTest values(3, 'blue cars');
insert into myTest values(4, 'green tree');
insert into myTest values(5, 'red dolly');

select left(name,8), count(*) 
from myTest 
group by left(name,8) 
having count(*) > 1;

答案 2 :(得分:0)

这种方法使用了一个非常基本的相似概念,但可以扩展到更好的定义。注意,这不是很有效率。 count(1) + 1包含基本短语。

create table phrases ( phrase varchar(max) )
insert phrases values( 'blue car' ), ( 'blue cars' ), ('green tree' ), ( 'red doll' ), ( 'red dolly' )

create function dbo.fnSimilar( @s1 varchar(max), @s2 varchar(max) )
returns int
begin
    if @s1 = @s2 return 0 -- a phrase is not similar to itself
    if @s1 like @s2 + '%' return 1
    if @s2 like @s1 + '%' return 2
    return 0
end

select x.phrase, similar = count(1) + 1 from 
(
    select p1.phrase from phrases p1
    inner join phrases p2 on dbo.fnSimilar( p2.phrase, p1.phrase ) = 1
) x
group by x.phrase

结果:

phrase      similar
--------    -------
blue car    2
red doll    2