不要求SQLCLR在C#中使用正则表达式,从5亿行的表中获取URL的“词干”的最佳方法是什么?该列是VarChar(3000),表格的行如下:
http://a.com
http://b.au
http://c.edu?a=3
http://d.com/?a=3
http://d.com/?a=3&b=2
http://d.com/?a=3&b=2
我需要从表中选择并获得此结果集:
http://a.com 1
http://b.au 1
http://c.edu 1
http://d.com 3
感谢。
答案 0 :(得分:3)
如果您的所有网址“干扰”都以“?”结尾或者'/?',你可以用它。可根据需要将其他截止模式添加到CASE语句中:
DECLARE @test TABLE (URL varchar(3000))
INSERT INTO @test (URL) VALUES ('http://a.com')
INSERT INTO @test (URL) VALUES ('http://b.au')
INSERT INTO @test (URL) VALUES ('http://c.edu?a=3')
INSERT INTO @test (URL) VALUES ('http://d.com/?a=3')
INSERT INTO @test (URL) VALUES ('http://d.com/?a=3&b=2')
INSERT INTO @test (URL) VALUES ('http://d.com/?a=3&b=2')
SELECT SUBSTRING(URL, 0,
CASE
WHEN PATINDEX('%/?%', URL) > 0 THEN PATINDEX('%/?%', URL)
WHEN PATINDEX('%?%', URL) > 0 THEN PATINDEX('%?%', URL)
ELSE LEN(URL) + 1
END), COUNT(*)
FROM @test
GROUP BY SUBSTRING(URL, 0,
CASE
WHEN PATINDEX('%/?%', URL) > 0 THEN PATINDEX('%/?%', URL)
WHEN PATINDEX('%?%', URL) > 0 THEN PATINDEX('%?%', URL)
ELSE LEN(URL) + 1
END)
答案 1 :(得分:2)
怎么样;
;with test (url) as (
select 'http://a.com' union
select 'http://b.au' union
select 'http://c.edu?a=3' union
select 'http://d.com/?a=3' union
select 'http://d.com/?a=3&b=2' union all
select 'http://d.com/?a=3&b=2'
)
select
rtrim(replace(left(url, charindex('?', url + '?', 1) - 1) + ' ', '/ ', ''))
from test
>>>
http://a.com
http://b.au
http://c.edu
http://d.com
http://d.com
更改为
...,COUNT(*)
from test
group by rtrim(replace(left(url, charindex('?', url + '?', 1) - 1) + ' ', '/ ', ''))
为小组。