描述

Question

在Java中，我需要匹配没有href属性的字符串中的<a>标记。例如，在以下字符串中：

text <a class="aClass" href="#">link1</a> text <a class="aClass" target="_blank">link2</a> text

它不应与<a class="aClass" href="#">link1</a>匹配（因为它包含href）但它应匹配<a class="aClass" target="_blank">link2</a>（因为它不包含href）。

我设法建立RegEx以匹配我的标签：

<a[^>]*>(.*?)</a>

但我无法弄清楚如何用href消除标签

（我知道我可以使用HTML解析器等，但我需要使用RegEx。

Answer 1

描述

请谨慎使用<a[^>]*之类的正则表达式，因为它们还会匹配以a开头的其他有效html标记，例如<abbr>或<address>。另外，简单地查找字符串href的存在是不够的，因为该字符串可能位于另一个属性的值内，例如<a class="thishrefstuff"...，或者是<a hreflang="en"...之类的另一个属性的一部分

此表达式将：

匹配所有不包含<a属性的锚标记</a> ... href。
它会强制标记名称为a，而不是标记为a

<address>

忽略属性名称中嵌入子字符串href的属性，如有效hreflang='en'或构成Attributehref="some value"。
忽略所有格式正确的属性（如bogus='href=""'

<a(?=\s|>)(?!(?:[^>=]|=(['"])(?:(?!\1).)*\1)*?\shref=['"])[^>]*>.*?<\/a>

enter image description here

扩展

<a(?=\s|>)匹配open标记并确保标记名称后面的空格或紧密括号后面的下一个，这会强制名称为a而不是其他
(?!如果我们在此标记中找到一个href，那么这个标记就不是我们正在寻找的标记了
- (?:启动非捕获组以遍历标记内的所有字符
- [^>=]匹配所有阻止正则表达式引擎离开标记的非标记结束字符，以及阻止引擎继续盲目匹配所有字符的非等号，
- |或
- =(['"])匹配一个等号后跟一个打开的双引号或单引号。引用被捕获到组2中，以便以后可以正确配对
- (?:(?!\1).)*匹配所有不是与开放报价
- \1匹配正确的关闭报价
- )*?关闭非捕获组，并根据需要重复，直到
- \shref=['"]匹配所需的href属性。 \s和=["']确保属性名称只是href
- )关闭否定前瞻
[^>]*>.*?<\/a>匹配从打开到关闭的整个字符串

Java代码示例：

输入文字

<abbr>RADIO</abbr> text <a class="aClass" href="#">link1</a> text <a bogus='href=""' class="aClass" target="_blank">link2</a> text

<强>代码

如果您希望在替换函数中使用它来删除非href-anchor标记，那么只需替换所有匹配项。

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("<a(?=\\s|>)(?!(?:[^>=]|=(['\"])(?:(?!\\1).)*\\1)*?\\shref=['\"])[^>]*>.*?<\\/a>
",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

<强>匹配

$matches Array:
(
    [0] => Array
        (
            [0] => <a bogus='href=""' class="aClass" target="_blank">link2</a>
        )

    [1] => Array
        (
            [0] => 
        )

)

Answer 2

我觉得奇怪的是你需要用正则表达式做这件事，但你可以使用负向前瞻。

<a(?![^>]+href).*?>(.*?)</a>

Answer 3

我不是Java专家，但您可以尝试这样的事情：

String regex = new String("(?i)<a(?>[^h>]++|(?<! )h++|h++(?!ref\\s*+=))*>((?>[^<]++|<(?!/a>))*)</a>");
String replacement = new String("$1");
str.replaceAll(regex,replacement);

Answer 4

您拥有的一个选项是首先匹配所有标记，然后使用正则表达式匹配那些可以忽略它们的正则表达式。所以你的伪代码看起来像：

<a>tags = html.find(all<a>tags);
for(String <a>tag : <a>tags){
    if(<a>tag.isHref()) continue;
    //do proccessing
}

RegEx匹配<a> html tags without specific attribute</a>

4 个答案:

描述

扩展

Java代码示例：