SQL使用字符串的最常见部分获取计数

时间:2014-06-18 09:45:31

标签: sql sql-server-2008 ssms

我能够获得具有最多相同值的列数,例如

SELECT     COUNT(*) AS Count, ProjectID
FROM         Projects
GROUP BY ProjectID
ORDER BY Count DESC

所以现在我有这样的表,

ProjectID    ProjectUrl
1            http://www.CompanyA.com/Projects/123
2            http://www.CompanyB.com/Projects/124
3            http://www.CompanyA.com/Projects/125
4            http://www.CompanyB.com/Projects/126
5            http://www.CompanyA.com/Projects/127

现在没有提供任何参数的预期结果

ProjectUrl = http://www.CompanyA.com Count = 3
ProjectUrl = http://www.CompanyB.com Count = 2

修改

抱歉,我忘记提及表格中的网址类型,但网址随意安静,但有些网址很常见。由于我们正在创建项目类别,因此项目类别URL可以是

https://spanish.CompanyAa2342.com/portal/projectA/projectTeamA/ProjectPersonA/Task/124

但是对于某些项目没有项目团队等等,所以它有点随机:?

我需要查询更像通用的内容。

Url的共同点

http://ramdomLanguage.CompanyName.com/portal/RandomName .....

2 个答案:

答案 0 :(得分:2)

请尝试:

select 
    Col, 
    COUNT(Col) Cnt
from(
    select
        SUBSTRING(ProjectUrl, 0, PATINDEX('%.com/%', ProjectUrl)+4) Col
    from tbl
)x group by Col

SQL Fiddle Demo

答案 1 :(得分:0)

在处理庞大的数据集时不确定性能,但这是一个解决方案。我试图为每个网址部分排一行,用/分隔。然后在最后进行快速聚合以显示每个单独部分的计数。小提琴在这里:http://www.sqlfiddle.com/#!3/742c4/12(为了演示而我添加了一行 - 感谢TechT。)

WITH cteFSPositions
AS
(
    SELECT      ProjectID,
                ProjectURL,
                1 AS CharPos,
                MAX(LEN(ProjectURL)) AS MaxLen,
                CHARINDEX('/', ProjectURL) AS FSPos
    FROM        Projects
    GROUP BY    ProjectID,
                ProjectURL

    UNION ALL

    SELECT      ProjectID,
                ProjectURL,
                CharPos + 1,
                MaxLen,
                CHARINDEX('/', ProjectURL, CharPos + 1) AS FSPos
    FROM        cteFSPositions
    WHERE       CharPos <= MaxLen
),
cteProjectURLParts
AS
(
    SELECT DISTINCT     ProjectID,
                        LEFT(ProjectURL, FSPos) AS ProjectURLPart,
                        FSPos
    FROM                cteFSPositions
    WHERE               FSPos > 0

    UNION ALL

    SELECT              ProjectID,
                        ProjectURL,
                        LEN(ProjectURL)
    FROM                Projects
),
cteFilteredProjectURLParts
AS
(
    SELECT      ProjectID,
                ProjectURLPart
    FROM        cteProjectURLParts
    WHERE       ProjectURLPart NOT IN ('http:', 'http:/', 'http://', 'https:', 'https:/', 'https://')
)

SELECT          ProjectURLPart,
                COUNT(*) AS Instances
FROM            cteFilteredProjectURLParts
GROUP BY        ProjectURLPart
ORDER BY        Instances DESC,
                ProjectURLPart;

这会产生(我添加了额外的行):

ProjectURLPart                                     Instances
http://www.CompanyA.com/                                   4
http://www.CompanyA.com/Projects/                          3
http://www.CompanyB.com/                                   2
http://www.CompanyB.com/Projects/                          2
http://www.CompanyA.com/BlahblahBlah/                      1
http://www.CompanyA.com/BlahblahBlah/More1/                1
http://www.CompanyA.com/BlahblahBlah/More1/More2           1
http://www.CompanyA.com/Projects/123                       1
http://www.CompanyA.com/Projects/125                       1
http://www.CompanyA.com/Projects/127                       1
http://www.CompanyB.com/Projects/124                       1
http://www.CompanyB.com/Projects/126                       1

编辑:哎呀,原帖有正在进行的小提琴代码。提供了最终的代码和更新的小提琴链接。

编辑2:由于我正在削减网址的方式,我意识到我正在切断网址的末尾部分。为了完整性&#39;为此,我已将它们添加回最终数据集中。更新了小提琴。