正则表达式 - 如何查找HTML <a> tag content by it&#39;s class?

时间:2015-06-04 08:22:41

标签: java regex

I need to get the content of an <a> html tag by a certain css class name. The css class that I need find is: whtbigheader

What I done so far is this:

    content = "<A HREF='/articles/0,7340,L-4664450,00.html' CLASS='whtbigheader' style='color:#FFFFFF;' HM=1>need to get this value</A>";

    Pattern p = Pattern.compile("<A.+?class\\s*?=[whtbigheader]['\"]?([^ '\"]+).*?>(.*?)</A>");
    Matcher m = p.matcher(content);

    if (m.find()) {
        System.out.println("found");
        System.out.println(m.group(1));
    }
    else {
        System.out.println("not found");
    }

The expected value is: need to get this value

More info:

  • Can use only regex
  • The content is an whole HTML String

Any ideas how to find it?

4 个答案:

答案 0 :(得分:4)

我是regex使用html parsing的憎恨,这就是为什么解决方案可能不是请求者所希望的原因:

使用Jsoup来实现这一目标:

String html; // your html code
Document doc = Jsoup.parse(html);
Elements elements=doc.select(".whtbigheader")`  //<-- that's it, it contains all the tags with whtbigheader as its class.

确保您只获得a代码:

Elements elements=doc.select("a").select(".whtbigheader");

从你那里获取文本只需要遍历元素并获取文本:

for(Element element : elements){
   System.out.println(element.text());
}

下载链接:

下载Jsoup 1.8.2点击here:)。

答案 1 :(得分:1)

使用非捕获组而不是方括号来匹配单词。

Pattern p = Pattern.compile("(?i)<A.+?class\\s*?=(['\"])?(?:whtbigheader)\\1[^>]*>(.*?)</A>");
Matcher m = p.matcher(content);

if (m.find()) {
    System.out.println("found");
    System.out.println(m.group(2));
}
else {
    System.out.println("not found");
}

DEMO

IDEONE

答案 2 :(得分:1)

解析器是从HTML中提取信息的更健壮的方法。但是,在这种情况下,可以使用正则表达式来获得你想要的东西(假设你永远不会有嵌套的锚标签 - 如果你有嵌套的锚标签,那么你可能想要理智地检查你的文件,你会肯定需要一个解析器。)

您可以使用以下正则表达式(使用不区分大小写的标记):

"<a\\s+(?:[^>]+\\s+)?bclass\\s*=\\s*(?:whtbigheader(?=\\s|>)|(['\"])(?:(?:(?!\\1).)*?\\s+)*whtbigheader(?:\\s+(?:(?!\\1).)*?)*\\1)[^>]*>(.*?)</a>"

您想要像这样提取第二组匹配:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test {
static final Pattern ANCHOR_PATTERN = Pattern.compile(
        "<a\\s+(?:[^>]+\\s+)?class\\s*=\\s*(?:whtbigheader(?=\\s|>)|(['\"])(?:(?:(?!\\1).)*?\\s+)*whtbigheader(?:\\s+(?:(?!\\1).)*?)*\\1)[^>]*>(.*?)</a>",
        Pattern.CASE_INSENSITIVE
);
public static String getAnchorContents( final String html ){
    final Matcher matcher = ANCHOR_PATTERN.matcher( html );
    if ( matcher.find() ){
        return matcher.group(2);
    }
    return null;
}

public static void main( final String[] args ){
    final String[] tests = {
            "<a class=whtbigheader>test</a>",
            "<a class=\"whtbigheader\">test</a>",
            "<a class='whtbigheader'>test</a>",
            "<a class =whtbigheader>test</a>",
            "<a class =\"whtbigheader\">test</a>",
            "<a class ='whtbigheader'>test</a>",
            "<a class= whtbigheader>test</a>",
            "<a class= \"whtbigheader\">test</a>",
            "<a class= 'whtbigheader'>test</a>",
            "<a class = whtbigheader>test</a>",
            "<a class\t=\r\n\"whtbigheader\">test</a>",
            "<a class =\t'whtbigheader'>test</a>",
            "<a class=\"otherclass whtbigheader\">test</a>",
            "<a class=\"whtbigheader otherclass\">test</a>",
            "<a class=\"whtbigheader2 whtbigheader\">test</a>",
            "<a class=\"otherclass whtbigheader otherotherclass\">test</a>",
            "<a class=whtbigheader href=''>test</a>",
    };
    int successes = 0;
    int failures = 0;
    for ( final String test : tests )
    {
        final String contents = getAnchorContents( test );
        if ( "test".equals( contents ) )
            successes++;
        else
        {
            System.err.println( test + " => " + contents );
            failures++;
        }
    }
    final String[] failingTests = {
            "<a class=whtbigheader2>test</a>",
            "<a class=awhtbigheader>test</a>",
            "<a class=whtbigheader-other>test</a>",
            "<a class='whtbigheader2'>test</a>",
            "<a class='awhtbigheader'>test</a>",
            "<a class='whtbigheader-other'>test</a>",
            "<a class=otherclass whtbigheader>test</a>",
            "<a class='otherclass' whtbigheader='value'>test</a>",
            "<a class='otherclass' id='whtbigheader'>test</a>",
            "<a><aclass='whtbigheader'>test</aclass></a>",
            "<a aclass='whtbigheader'>test</a>",
            "<a class='whtbigheader\"'>test</a>",
            "<ab class='whtbigheader'><a>test</a></ab>",
    };
    for ( final String test : failingTests )
    {
        final String contents = getAnchorContents( test );
        if ( contents == null )
            successes++;
        else
        {
            System.err.println( test + " => " + contents );
            failures++;
        }
    }
    System.out.println( "Successful tests: " + successes );
    System.out.println( "Failed tests: " + failures );
}
}

答案 3 :(得分:0)

您可以使用以下正则表达式:

/<a[^>]*class=\s?['"]\s?whtbigheader\s?['"][^>]*>(.*?)</a>/i

Demo

enter image description here

请注意,如果您只想要标记a的内容与某个类,那么您不需要在标记内添加额外的正则表达式a[^>]*class='whtbigheader'[^>]*就可以完成这项任务:

[^>]*将匹配除>

之外的任何内容

此外,您需要使用修饰符iIGNORE CASE)来忽略大小写!

此外,正则表达式不是解析(?:X|H)TML文档的正确方法。您可以考虑使用正确的Parser。

请注意,如果您使用正则表达式的引号,则需要转义类名称旁边的引号。