SQL Server获取URL仅从TSQL开始?

时间:2011-08-02 15:48:34

标签: sql sql-server aggregation

不要求SQLCLR在C#中使用正则表达式,从5亿行的表中获取URL的“词干”的最佳方法是什么?该列是VarChar(3000),表格的行如下:

http://a.com
http://b.au
http://c.edu?a=3
http://d.com/?a=3
http://d.com/?a=3&b=2
http://d.com/?a=3&b=2

我需要从表中选择并获得此结果集:

http://a.com             1 
http://b.au              1
http://c.edu             1       
http://d.com             3

感谢。

2 个答案:

答案 0 :(得分:3)

如果您的所有网址“干扰”都以“?”结尾或者'/?',你可以用它。可根据需要将其他截止模式添加到CASE语句中:

DECLARE @test TABLE (URL varchar(3000))

INSERT INTO @test (URL) VALUES ('http://a.com')
INSERT INTO @test (URL) VALUES ('http://b.au')
INSERT INTO @test (URL) VALUES ('http://c.edu?a=3')
INSERT INTO @test (URL) VALUES ('http://d.com/?a=3')
INSERT INTO @test (URL) VALUES ('http://d.com/?a=3&b=2')
INSERT INTO @test (URL) VALUES ('http://d.com/?a=3&b=2')

SELECT SUBSTRING(URL, 0, 
    CASE
        WHEN PATINDEX('%/?%', URL) > 0 THEN PATINDEX('%/?%', URL)
        WHEN PATINDEX('%?%', URL) > 0 THEN PATINDEX('%?%', URL)
        ELSE LEN(URL) + 1
    END), COUNT(*)
FROM @test
GROUP BY SUBSTRING(URL, 0, 
    CASE
        WHEN PATINDEX('%/?%', URL) > 0 THEN PATINDEX('%/?%', URL)
        WHEN PATINDEX('%?%', URL) > 0 THEN PATINDEX('%?%', URL)
        ELSE LEN(URL) + 1
    END)

答案 1 :(得分:2)

怎么样;

;with test (url) as (
    select 'http://a.com' union
    select 'http://b.au' union
    select 'http://c.edu?a=3' union
    select 'http://d.com/?a=3' union
    select 'http://d.com/?a=3&b=2' union all
    select 'http://d.com/?a=3&b=2'
)
select
    rtrim(replace(left(url, charindex('?', url + '?', 1) - 1) + ' ', '/ ', ''))
from test


>>>
http://a.com
http://b.au
http://c.edu
http://d.com
http://d.com

更改为

...,COUNT(*)
from test
    group by rtrim(replace(left(url, charindex('?', url + '?', 1) - 1) + ' ', '/ ', ''))

为小组。