Oracle SQL regexp_replace在OR组处停止

时间:2019-04-11 17:02:10

标签: sql regex oracle replace

我尝试使用oracle SQL regexp_replace过滤URL列表中的域名。问题似乎是其中一些确实具有端口号,而有些则没有。

在以下示例中,应将 the-super.hosting.com 替换为 HOSTNAME (但不要在regexpr中进行硬编码,因为可能有任何内容)

WITH strings AS (   
  SELECT 'http://wwww11.the-super.hosting.com:9999/aPath/servlet?config=abcLoginNr=%1' s FROM dual union all   
  SELECT 'http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2' s FROM dual union all   
  SELECT 'http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2' s FROM dual union all   
  SELECT 'http://wwww04.the-super.hosting.com/aPath/servlet?config#here' s FROM dual   
)  
  SELECT regexp_replace(s,'([[:alpha:]]+://[[:alpha:]]{4}[[:digit:]]{2}\.)(.+)(:9999/|:6666/|/?)(.+)', '\1HOSTNAME\3\4') "MODIFIED_STRING", s "STRING"
  FROM strings;

似乎无法使用正常路径将端口作为可选端口处理(因为该路径直接开始)。
是否可以以不同的方式匹配域部分,以便始终将剩下的部分作为带有可选端口的路径?
有没有办法用一条语句代替它?

1 个答案:

答案 0 :(得分:4)

我认为您正在使它变得更加复杂。您真的只需要三个部分;初始协议(后跟://的任何内容)和www??.前缀(假定实际上始终存在);其余域名要删除;剩下的一切,可能包括端口,也可能不包括端口,但是您并不在乎;所以:

([^.]+\.)([^/:]+)(.*)

其中

  • ([^.]+\.)是协议,包括域名中的第一个点在内的所有内容;
  • ([^/:]+)最多是斜杠或冒号
  • (.*)是剩下的

对于替换,您希望第一部分和第三部分保持不变,并用固定的HOSTNAME替换第二部分。

所以您得到:

WITH strings AS (
  SELECT 'http://wwww11.the-super.hosting.com:9999/aPath/servlet?config=abcLoginNr=%1' s FROM dual union all
  SELECT 'http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2' s FROM dual union all
  SELECT 'http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2' s FROM dual union all
  SELECT 'http://wwww04.the-super.hosting.com/aPath/servlet?config#here' s FROM dual union all
  SELECT 'http://wwww04.the-super.hosting.com' s FROM dual union all
  SELECT 'http://wwww04.the-super.hosting.com/' s FROM dual union all
  SELECT 'http://wwww04.the-super.hosting.com/aPath' s FROM dual union all
  SELECT 'http://wwww04.the-super.hosting.com:1234' s FROM dual union all
  SELECT 'http://wwww04.the-super.hosting.com:1234/' s FROM dual union all
  SELECT 'http://wwww04.the-super.hosting.com:1234/aPath' s FROM dual
)  
SELECT regexp_replace(s, '([^.]+\.)([^/:]+)(.*)', '\1HOSTNAME\3') "MODIFIED_STRING", s "STRING"
FROM strings;

MODIFIED_STRING                                                STRING                                                                     
-------------------------------------------------------------- ---------------------------------------------------------------------------
http://wwww11.HOSTNAME:9999/aPath/servlet?config=abcLoginNr=%1 http://wwww11.the-super.hosting.com:9999/aPath/servlet?config=abcLoginNr=%1
http://wwww22.HOSTNAME:6666/aPath/servlet?config=abcLoginNr=%2 http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2
http://wwww22.HOSTNAME:6666/aPath/servlet?config=abcLoginNr=%2 http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2
http://wwww04.HOSTNAME/aPath/servlet?config#here               http://wwww04.the-super.hosting.com/aPath/servlet?config#here              
http://wwww04.HOSTNAME                                         http://wwww04.the-super.hosting.com                                        
http://wwww04.HOSTNAME/                                        http://wwww04.the-super.hosting.com/                                       
http://wwww04.HOSTNAME/aPath                                   http://wwww04.the-super.hosting.com/aPath                                  
http://wwww04.HOSTNAME:1234                                    http://wwww04.the-super.hosting.com:1234                                   
http://wwww04.HOSTNAME:1234/                                   http://wwww04.the-super.hosting.com:1234/                                  
http://wwww04.HOSTNAME:1234/aPath                              http://wwww04.the-super.hosting.com:1234/aPath                             

您可以更明确地了解协议格式等。但是我不确定这有什么意义。


原始模式的问题是贪婪和可选的斜杠(端口号的最后一个“或”部分)混合在一起。您可以对其进行调整,至少在您的示例数据中进行调整,例如:

WITH strings AS (   
  SELECT 'http://wwww11.the-super.hosting.com:9999/aPath/servlet?config=abcLoginNr=%1' s FROM dual union all   
  SELECT 'http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2' s FROM dual union all   
  SELECT 'http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2' s FROM dual union all   
  SELECT 'http://wwww04.the-super.hosting.com/aPath/servlet?config#here' s FROM dual   
)  
SELECT regexp_replace(s,'([[:alpha:]]+://[[:alpha:]]{4}[[:digit:]]{2}\.)(.+?)(:9999/|:6666/|/)(.+)$', '\1HOSTNAME\3\4') "MODIFIED_STRING", s "STRING"
--                                                                         ^               ^^^    ^
FROM strings;

MODIFIED_STRING                                                STRING                                                                     
-------------------------------------------------------------- ---------------------------------------------------------------------------
http://wwww11.HOSTNAME:9999/aPath/servlet?config=abcLoginNr=%1 http://wwww11.the-super.hosting.com:9999/aPath/servlet?config=abcLoginNr=%1
http://wwww22.HOSTNAME:6666/aPath/servlet?config=abcLoginNr=%2 http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2
http://wwww22.HOSTNAME:6666/aPath/servlet?config=abcLoginNr=%2 http://wwww22.the-super.hosting.com:6666/aPath/servlet?config=abcLoginNr=%2
http://wwww04.HOSTNAME/aPath/servlet?config#here               http://wwww04.the-super.hosting.com/aPath/servlet?config#here              

但似乎有点过分。