使用正则表达式从文本中删除网址

时间:2018-04-23 15:35:19

标签: java regex

我需要一个正则表达式匹配知识网址,例如www.example.com https://www.example.com而不是example.example,因此我需要使用正则表达式修复com | fr | org域名。

我试过了:

String txt = "blabla https://www.pris.com https://pris.com www.Iris.fr iris.com no.po";

        txt = txt.replaceAll("^*[a-zA-Z0-9\\-\\.]+\\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)$","");// delete www or http or https starting
        txt = txt.replaceAll("((http|https)://)?[a-zA-Z]\\w*(\\.\\w+)+(/\\w*(\\.\\w+)*)*(\\?.+)*","");

1 个答案:

答案 0 :(得分:1)

试试这个正则表达式

(?i)\b(?!mailto:)(?:(?:https?|ftp)://)?(?:\S+(?::\S*)?@)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?-i:com|org|net|mil|edu|fr|COM|ORG|NET|MIL|EDU|FR))))\b(?:/[^/\s]*)*/?

在这里试试https://www.regexplanet.com/share/index.html?share=yyyyy72am6r

可读版本

 (?i)
 \b 
 (?! mailto: )
 (?:
      (?: https? | ftp )
      ://
 )?
 (?:
      \S+ 
      (?: : \S* )?
      @
 )?
 (?:
      (?:
           (?: [1-9] \d? | 1 \d\d | 2 [01] \d | 22 [0-3] )
           (?:
                \.
                (?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
           ){2}
           (?:
                \.
                (?: [1-9] \d? | 1 \d\d | 2 [0-4] \d | 25 [0-4] )
           )
        |  (?:
                (?: [a-z\u00a1-\uffff0-9]+ -? )*
                [a-z\u00a1-\uffff0-9]+ 
           )
           (?:
                \.
                (?: [a-z\u00a1-\uffff0-9]+ -? )*
                [a-z\u00a1-\uffff0-9]+ 
           )*
           (?:
                \.
                (?-i: com | org | net | mil | edu | fr | COM | ORG | NET | MIL | EDU | FR )
           )
      )
 )
 \b 
 (?: / [^/\s]* )*
 /?