增强正则表达式以匹配更多URL

时间:2015-11-17 01:22:47

标签: java regex url

考虑到这个正则表达式:

  static String AdrPattern="(?:http://www\\.([^/&]+)\\.com/|(?!^)\\G)/?([^/]+)";

我有两个小问题:

  1. 如何才能使其匹配仅包含该网址的网址 域名,没有任何进一步的路径/段? (如 https://stackoverflow.com
  2. 如何才能使此正则表达式匹配具有不同域扩展名的网址?
  3. P.S :正则表达式取自here并且运行正常,但这两个缺点应该修复。

    修改

    根据以下代码,对此帖子的回答将跳过更多细分,只打印域名

             static String AdrPattern= "(?:(?!\\A)\\G(?:/([^\\s/]+))|http://www\\.([^\\s/&]+)\\.(?:com|net|gov|org)(?:/([^\\s/]+))?)";
             static Pattern WebUrlPattern = Pattern.compile (AdrPattern);
             WebUrlMatcher= WebUrlPattern.matcher(line);
    
    
    
            int cn=0;
            while(WebUrlMatcher.find()) {
    
        if (cnt == 0) 
            {
               String extractedPath = WebUrlMatcher.group(1);
    
               if(extractedPath!=null){
    
                fop.write(prefix.toLowerCase().getBytes());
    
    
                fop.write(System.getProperty("line.separator").getBytes());
    
    
    
                }
    
      if(extractedPath!=null)
      {
                    fop.write(extractedPath.toLowerCase().getBytes());
    
                    fop.write(System.getProperty("line.separator").getBytes());
      }        
    
           String extractedPart = WebUrlMatcher.group(2);
           String extractedPart = WebUrlMatcher.group(2);
       String extracted2=WebUrlMatcher.group(3);
       if(extractedPart!=null)
       {
                fop.write(extractedPart.toLowerCase().getBytes());       
                fop.write(System.getProperty("line.separator").getBytes());
    
                if(extracted2!=null)
                {
                fop.write(extracted2.toLowerCase().getBytes());
                fop.write(System.getProperty("line.separator").getBytes());
                }
    
       cnt = cnt + 1;
    
       }
    }
        }
    
        }
    

1 个答案:

答案 0 :(得分:1)

这是一种方法。略微操纵当前的正则表达式。
只需测试捕获组。

 "(?:(?!\\A)\\G(?:/([^\\s/]+))|http://www\\.([^\\s/&]+)\\.(?:com|net)(?:/([^\\s/]+))?)"

 (?:
      (?! \A )                      # Not BOS
      \G                            # Start from last match
      (?:
           /  
           ( [^\s/]+ )                   # (1), Required Next Segment path (or fail)
      )
   |                              # or,
      http://www\.                  # New match
      ( [^\s/&]+ )                  # (2), Domain
      \.
      (?: com | net )               # Extension
      (?:
           /  
           ( [^\s/]+ )                   # (3), Optional First Segment path
      )?
 )

测试捕获 -

输入:

http://www.asfdasdf.net/  
http://www.asfdasdf.net/first  
http://www.asfdasdf.net/first/second  

输出:

 **  Grp 0 -  ( pos 0 , len 23 ) 
http://www.asfdasdf.net  
 **  Grp 1 -  NULL 
 **  Grp 2 -  ( pos 11 , len 8 ) 
asfdasdf  
 **  Grp 3 -  NULL 

-------------

 **  Grp 0 -  ( pos 28 , len 29 ) 
http://www.asfdasdf.net/first  
 **  Grp 1 -  NULL 
 **  Grp 2 -  ( pos 39 , len 8 ) 
asfdasdf  
 **  Grp 3 -  ( pos 52 , len 5 ) 
first  

-------------

 **  Grp 0 -  ( pos 61 , len 29 ) 
http://www.asfdasdf.net/first  
 **  Grp 1 -  NULL 
 **  Grp 2 -  ( pos 72 , len 8 ) 
asfdasdf  
 **  Grp 3 -  ( pos 85 , len 5 ) 
first  

-------------

 **  Grp 0 -  ( pos 90 , len 7 ) 
/second  
 **  Grp 1 -  ( pos 91 , len 6 ) 
second  
 **  Grp 2 -  NULL 
 **  Grp 3 -  NULL