Question

我在文档中的html标记中有一些文本。文字看起来像这样

I need this text &lt;ref&gt; Some unwanted text &lt;/ref&gt; I need this text too

和

I need this text &lt;ref Some random text /&gt; I need this text too

如何删除不需要的文本以及封闭的标签？

我尝试使用这个正则表达式。但它没有用。

&lt;ref(.*?)&gt;(.*?)&lt;/ref&gt;

和

&lt;ref(.*?)&gt;

在Java中尝试这种方式并没有帮助：

regex = "&lt;ref(.*?)&gt;(.*?)&lt;/ref&gt;";
p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE); 
m = p.matcher(s);
while(m.find()){
   m.replaceAll(" ");           
}

我知道如何获得解决方案？

Answer 1

首先，使用HTML解析器。如果HTML变得复杂，正则表达式将无法可靠地处理此任务。

其次，您的正则表达式似乎格式正确且work as expected在简单示例上（一旦我将<更改为<，也就是说，但我怀疑您在发布时更改了问题，认为StackOverflow会误解它）。问题可能在于您的Java代码，而不是正则表达式本身。我不熟悉Java的正则表达式API，因此我会让别人对此负责：）

Answer 2

使用RegEx should be avoided进行HTML解析。
既然你的是一个相对简单的，我们就说我们去吧。您正在匹配实际的HTML，因此您不希望<，您想要实际的<（>，>。
```
<ref[^>]*/>|<ref>[^<]*</ref>
```
到目前为止我应该知道，我还没有在Java中使用正则表达式，所以我不知道是否需要转义它中的/。

Answer 3

字符串是不可变的，因此replaceAll()与任何其他“字符串变异”方法一样，将结果作为新字符串返回。

String[] ss = { 
    "I need this text &lt;ref&gt; Some unwanted text &lt;/ref&gt; I need this text too",
    "I need this text &lt;ref Some random text /&gt; I need this text too"
};

String r = "&lt;ref(.*?)&gt;(.*?)&lt;/ref&gt;|&lt;ref(.*?)&gt;";

Pattern p = Pattern.compile(r, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
for (String s0 : ss)
{
  Matcher m = p.matcher(s0);
  String s1 = m.replaceAll("");
  System.out.printf("%n%s%n%s%n", s0, s1);
}

输出：

I need this text <ref> Some unwanted text </ref> I need this text too
I need this text I need this text too

I need this text <ref Some random text /> I need this text too
I need this text I need this text too

其他一些说明：

当我整合你的正则表达式时，我不得不使用较长的正则表作为第一个替代方案。重要的是按顺序尝试它们，因为较短的一个（对于空/自闭标签）可以匹配普通标签，而不需要它。
您无需致电find();这是replaceAll()做的第一件事。如果没有匹配，则只返回原始字符串。
MULTILINE标志没有做任何有用的事情，因为你的正则表达式（或我的）中没有行锚（^和$）。

如何删除＆lt; ref＆gt;之间的文本和＆lt; / ref＆gt;

3 个答案: