删除包含URL的句子

时间:2019-06-14 23:53:12

标签: java regex string

我正在寻找一种删除Java中包含URL的句子的方法。请注意,我要删除整个句子,而不仅仅是URL。

我找到了一种方法来做到这一点,并且它起作用了,但是我正在寻找一种更简单的方法,也许只用一个RegEx?

  1. 使用BreakIterator检测一个句子(可以以。?!结尾):Split string into sentences
  2. 使用正则表达式读取每一行并检测模式: Detect and extract url from a string?。如果找到,则删除句子。
String source = "Sorry, we are closed today. Visit our website tomorrow at https://www.google.com. Thank you and have a nice day!";
iterator.setText(source);
int start = iterator.first();
int end = iterator.next();
while(end != BreakIterator.DONE){                
 if(SENT.matcher(source.substring(start,end)).find()) {                  
   source = source.substring(0, start) + source.substring(end);                  
   iterator.setText(source);
   start = iterator.first();
  }else{
    start = end;
  }
  end = iterator.next();
}
System.out.println(source);

This prints : Sorry, we are closed today. Thank you and have a nice day!

2 个答案:

答案 0 :(得分:0)

  

最好先中断/拆分句子,然后再通过表达式。

然后,该表达式可能只返回没有URL的那些行(句子)

^(?!.*https?[^\s]+.*).*$

在这里,我们将URL定义为https?[^\s]+

Demo

测试

import java.util.regex.Matcher;
import java.util.regex.Pattern;

final String regex = "^(?!.*https?[^\\s]+.*).*$";
final String string = "Sorry, we are closed today. Visit our website tomorrow at https://www.google.com. Thank you and have a nice day!\n\n"
     + "Sorry, we are closed today. Visit our website tomorrow at. Thank you and have a nice day!\n\n"
     + "Sorry, we are closed today. Visit our website tomorrow at https://www.goog. Thank you and have a nice day!\n";

final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
    System.out.println("Full match: " + matcher.group(0));
    for (int i = 1; i <= matcher.groupCount(); i++) {
        System.out.println("Group " + i + ": " + matcher.group(i));
    }
}

RegEx电路

jex.im可视化正则表达式:

enter image description here

答案 1 :(得分:0)

"(?<=^|[?!.])[^?!.]+" + urlRegex + ".*?(?:$|[?!.])"

根据您对句子的定义,这将匹配每个与urlRegex匹配的整个句子;您可以使用replaceAll摆脱它们。 (周围有很多URL正则表达式,您没有指定要使用的正则表达式,因此我将URL的确切定义留给了您。)