我尝试使用oracle SQL regexp_replace过滤URL列表中的域名。问题似乎是其中一些确实具有端口号,而有些则没有。
在以下示例中,应将 the-super.hosting.com 替换为 HOSTNAME (但不要在regexpr中进行硬编码,因为可能有任何内容)>
WITH strings AS (
SELECT 'http://wwww11.the-super.hosting.com:9999/aPath/servlet?config=abcLoginNr=%1' s FROM dual union all
SELECT 'http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2' s FROM dual union all
SELECT 'http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2' s FROM dual union all
SELECT 'http://wwww04.the-super.hosting.com/aPath/servlet?config#here' s FROM dual
)
SELECT regexp_replace(s,'([[:alpha:]]+://[[:alpha:]]{4}[[:digit:]]{2}\.)(.+)(:9999/|:6666/|/?)(.+)', '\1HOSTNAME\3\4') "MODIFIED_STRING", s "STRING"
FROM strings;
似乎无法使用正常路径将端口作为可选端口处理(因为该路径直接开始)。
是否可以以不同的方式匹配域部分,以便始终将剩下的部分作为带有可选端口的路径?
有没有办法用一条语句代替它?
答案 0 :(得分:4)
我认为您正在使它变得更加复杂。您真的只需要三个部分;初始协议(后跟://
的任何内容)和www??.
前缀(假定实际上始终存在);其余域名要删除;剩下的一切,可能包括端口,也可能不包括端口,但是您并不在乎;所以:
([^.]+\.)([^/:]+)(.*)
其中
([^.]+\.)
是协议,包括域名中的第一个点在内的所有内容; ([^/:]+)
最多是斜杠或冒号(.*)
是剩下的对于替换,您希望第一部分和第三部分保持不变,并用固定的HOSTNAME
替换第二部分。
所以您得到:
WITH strings AS (
SELECT 'http://wwww11.the-super.hosting.com:9999/aPath/servlet?config=abcLoginNr=%1' s FROM dual union all
SELECT 'http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2' s FROM dual union all
SELECT 'http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2' s FROM dual union all
SELECT 'http://wwww04.the-super.hosting.com/aPath/servlet?config#here' s FROM dual union all
SELECT 'http://wwww04.the-super.hosting.com' s FROM dual union all
SELECT 'http://wwww04.the-super.hosting.com/' s FROM dual union all
SELECT 'http://wwww04.the-super.hosting.com/aPath' s FROM dual union all
SELECT 'http://wwww04.the-super.hosting.com:1234' s FROM dual union all
SELECT 'http://wwww04.the-super.hosting.com:1234/' s FROM dual union all
SELECT 'http://wwww04.the-super.hosting.com:1234/aPath' s FROM dual
)
SELECT regexp_replace(s, '([^.]+\.)([^/:]+)(.*)', '\1HOSTNAME\3') "MODIFIED_STRING", s "STRING"
FROM strings;
MODIFIED_STRING STRING
-------------------------------------------------------------- ---------------------------------------------------------------------------
http://wwww11.HOSTNAME:9999/aPath/servlet?config=abcLoginNr=%1 http://wwww11.the-super.hosting.com:9999/aPath/servlet?config=abcLoginNr=%1
http://wwww22.HOSTNAME:6666/aPath/servlet?config=abcLoginNr=%2 http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2
http://wwww22.HOSTNAME:6666/aPath/servlet?config=abcLoginNr=%2 http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2
http://wwww04.HOSTNAME/aPath/servlet?config#here http://wwww04.the-super.hosting.com/aPath/servlet?config#here
http://wwww04.HOSTNAME http://wwww04.the-super.hosting.com
http://wwww04.HOSTNAME/ http://wwww04.the-super.hosting.com/
http://wwww04.HOSTNAME/aPath http://wwww04.the-super.hosting.com/aPath
http://wwww04.HOSTNAME:1234 http://wwww04.the-super.hosting.com:1234
http://wwww04.HOSTNAME:1234/ http://wwww04.the-super.hosting.com:1234/
http://wwww04.HOSTNAME:1234/aPath http://wwww04.the-super.hosting.com:1234/aPath
您可以更明确地了解协议格式等。但是我不确定这有什么意义。
原始模式的问题是贪婪和可选的斜杠(端口号的最后一个“或”部分)混合在一起。您可以对其进行调整,至少在您的示例数据中进行调整,例如:
WITH strings AS (
SELECT 'http://wwww11.the-super.hosting.com:9999/aPath/servlet?config=abcLoginNr=%1' s FROM dual union all
SELECT 'http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2' s FROM dual union all
SELECT 'http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2' s FROM dual union all
SELECT 'http://wwww04.the-super.hosting.com/aPath/servlet?config#here' s FROM dual
)
SELECT regexp_replace(s,'([[:alpha:]]+://[[:alpha:]]{4}[[:digit:]]{2}\.)(.+?)(:9999/|:6666/|/)(.+)$', '\1HOSTNAME\3\4') "MODIFIED_STRING", s "STRING"
-- ^ ^^^ ^
FROM strings;
MODIFIED_STRING STRING
-------------------------------------------------------------- ---------------------------------------------------------------------------
http://wwww11.HOSTNAME:9999/aPath/servlet?config=abcLoginNr=%1 http://wwww11.the-super.hosting.com:9999/aPath/servlet?config=abcLoginNr=%1
http://wwww22.HOSTNAME:6666/aPath/servlet?config=abcLoginNr=%2 http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2
http://wwww22.HOSTNAME:6666/aPath/servlet?config=abcLoginNr=%2 http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2
http://wwww04.HOSTNAME/aPath/servlet?config#here http://wwww04.the-super.hosting.com/aPath/servlet?config#here
但似乎有点过分。