Question

我正在寻找一种删除Java中包含URL的句子的方法。请注意，我要删除整个句子，而不仅仅是URL。

我找到了一种方法来做到这一点，并且它起作用了，但是我正在寻找一种更简单的方法，也许只用一个RegEx？

使用BreakIterator检测一个句子（可以以。？！结尾）：Split string into sentences
使用正则表达式读取每一行并检测模式： Detect and extract url from a string?。如果找到，则删除句子。

String source = "Sorry, we are closed today. Visit our website tomorrow at https://www.google.com. Thank you and have a nice day!";
iterator.setText(source);
int start = iterator.first();
int end = iterator.next();
while(end != BreakIterator.DONE){                
 if(SENT.matcher(source.substring(start,end)).find()) {                  
   source = source.substring(0, start) + source.substring(end);                  
   iterator.setText(source);
   start = iterator.first();
  }else{
    start = end;
  }
  end = iterator.next();
}
System.out.println(source);

This prints : Sorry, we are closed today. Thank you and have a nice day!

Answer 1

最好先中断/拆分句子，然后再通过表达式。

然后，该表达式可能只返回没有URL的那些行（句子）

^(?!.*https?[^\s]+.*).*$

在这里，我们将URL定义为https?[^\s]+。

Demo

测试

import java.util.regex.Matcher;
import java.util.regex.Pattern;

final String regex = "^(?!.*https?[^\\s]+.*).*$";
final String string = "Sorry, we are closed today. Visit our website tomorrow at https://www.google.com. Thank you and have a nice day!\n\n"
     + "Sorry, we are closed today. Visit our website tomorrow at. Thank you and have a nice day!\n\n"
     + "Sorry, we are closed today. Visit our website tomorrow at https://www.goog. Thank you and have a nice day!\n";

final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
    System.out.println("Full match: " + matcher.group(0));
    for (int i = 1; i <= matcher.groupCount(); i++) {
        System.out.println("Group " + i + ": " + matcher.group(i));
    }
}

RegEx电路

jex.im可视化正则表达式：

Answer 2

"(?<=^|[?!.])[^?!.]+" + urlRegex + ".*?(?:$|[?!.])"

根据您对句子的定义，这将匹配每个与urlRegex匹配的整个句子；您可以使用replaceAll摆脱它们。（周围有很多URL正则表达式，您没有指定要使用的正则表达式，因此我将URL的确切定义留给了您。）

删除包含URL的句子

2 个答案:

Demo

测试

RegEx电路